6 Using Oracle Big Data SQL for Data Access

This chapter describes Oracle Big Data SQL. It contains the following topics:

6.1 What Is Oracle Big Data SQL?

Oracle Big Data SQL supports queries against vast amounts of big data stored in multiple data sources, including Hadoop. You can view and analyze data from various data stores together, as if it were all stored in an Oracle database.

Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the complete SQL syntax. You can execute the most complex SQL SELECT statements against data in Hadoop, either manually or using your existing applications, to tease out the most significant insights. For example, users of the Oracle Advanced Analytics database option can apply their data mining models, which reside in Oracle Database, to data that is resident on Oracle Big Data Appliance.

The following sections provide further details:

6.1.1 About Oracle External Tables

Oracle Big Data SQL provides external tables with next generation performance gains. An external table is an Oracle Database object that identifies and describes the location of data outside of a database. You can query an external table using the same SQL SELECT syntax that you use for any other database tables.

External tables use access drivers to parse the data outside the database. Each type of external data requires a unique access driver. This release of Oracle Big Data SQL includes two access drivers for big data: one for accessing data stored in Apache Hive, and the other for accessing data stored in Hadoop Distributed File System (HDFS) files.

6.1.2 About the Access Drivers for Oracle Big Data SQL

By querying external tables, you can access data stored in HDFS and Hive tables as if that data was stored in tables in an Oracle database. Oracle Database accesses the data by using the metadata provided when the external table was created.

Oracle Database 12.1.0.2 supports two new access drivers for Oracle Big Data SQL:

  • ORACLE_HIVE: Enables you to create Oracle external tables over Apache Hive data sources. Use this access driver when you already have Hive tables defined for your HDFS data sources. ORACLE_HIVE can also access data stored in other locations, such as HBase, that have Hive tables defined for them.

  • ORACLE_HDFS: Enables you to create Oracle external tables directly over files stored in HDFS. This access driver uses Hive syntax to describe a data source, assigning default column names of COL_1, COL_2, and so forth. You do not need to create a Hive table manually as a separate step.

    Instead of acquiring the metadata from a Hive metadata store the way that ORACLE_HIVE does, the ORACLE_HDFS access driver acquires all of the necessary information from the access parameters. The ORACLE_HDFS access parameters are required to specify the metadata, and are stored as part of the external table definition in Oracle Database.

Oracle Big Data SQL uses these access drivers to optimize query performance.

6.1.3 About Smart Scan Technology

External tables do not have traditional indexes, so that queries against them typically require a full table scan. However, Oracle Big Data SQL extends SmartScan capabilities, such as filter-predicate offloads, to Oracle external tables with the installation of Exadata storage server software on Oracle Big Data Appliance. This technology enables Oracle Big Data Appliance to discard a huge portion of irrelevant data—up to 99 percent of the total—and return much smaller result sets to Oracle Exadata Database Machine. End users obtain the results of their queries significantly faster, as the direct result of a reduced load on Oracle Database and reduced traffic on the network.

See Also:

Oracle Database Concepts for a general introduction to external tables and pointers to more detailed information in the Oracle Database documentation library

6.1.4 About Data Security with Oracle Big Data SQL

Oracle Big Data Appliance already provides numerous security features to protect data stored in a CDH cluster on Oracle Big Data Appliance:

  • Kerberos authentication: Requires users and client software to provide credentials before accessing the cluster.

  • Apache Sentry authorization: Provides fine-grained, role-based authorization to data and metadata.

  • On-disk encryption: Protects the data on disk and at rest. For normal user access, the data is automatically decrypted.

  • Oracle Audit Vault and Database Firewall monitoring: The Audit Vault plug-in on Oracle Big Data Appliance collects audit and logging data from MapReduce, HDFS, and Oozie services. You can then use Audit Vault Server to monitor these services on Oracle Big Data Appliance

Oracle Big Data SQL adds the full range of Oracle Database security features to this list. You can apply the same security policies and rules to your Hadoop data that you apply to your relational data.

6.2 Installing Oracle Big Data SQL

Oracle Big Data SQL is available only on Oracle Exadata Database Machine connected to Oracle Big Data Appliance. You must install the Oracle Big Data SQL software on both systems.

The following topics explain how to install Oracle Big Data SQL:

6.2.1 Prerequisites for Using Oracle Big Data SQL

Oracle Exadata Database Machine must comply with the following requirements:

  • Compute servers run Oracle Database and Oracle Enterprise Manager Grid Control 12.1.0.2 or later.

  • Storage servers run Exadata storage server software 12.1.1.1 or 12.1.1.0.

  • Oracle Exadata Database Machine is configured on the same InfiniBand subnet as Oracle Big Data Appliance

  • Oracle Exadata Database Machine is connected to Oracle Big Data Appliance by the InfiniBand network.

6.2.2 Performing the Installation

Take these steps to install the Oracle Big Data SQL software on Oracle Big Data Appliance and Oracle Exadata Database Machine:

  1. Download My Oracle Support one-off patch 19377855 for RDBMS 12.1.0.2.

  2. On Oracle Exadata Database Machine, install patch 19377855 on:

    • Oracle Enterprise Manager Grid Control home

    • Oracle Database homes

    See the patch README for step-by-step instructions for installing the patch.

  3. On Oracle Big Data Appliance, install or upgrade the software to the latest version. See Oracle Big Data Appliance Owner's Guide.

    You can select Oracle Big Data SQL as an installation option when using the Oracle Big Data Appliance Configuration Generation Utility. See Oracle Big Data Appliance Owner's Guide.

  4. Download and install Mammoth patch 18064328 from Oracle Automated Release Updates.

  5. If Oracle Big Data SQL is not enabled during the installation, then use the bdacli utility:

    # bdacli enable big_data_sql
    

    See Oracle Big Data Appliance Owner's Guide.

  6. On Oracle Exadata Database Machine, run the post-installation script.

    See "Running the Post-Installation Script for Oracle Big Data SQL".

You can use Cloudera Manager to verify that Oracle Big Data SQL is up and running. See "Managing Oracle Big Data SQL".

6.2.3 Running the Post-Installation Script for Oracle Big Data SQL

To run the Oracle Big Data SQL post-installation script:

  1. On Oracle Exadata Database Machine, ensure that the Oracle Database listener is running and listening on an interprocess communication (IPC) interface.

  2. Verify the name of the Oracle installation owner. Typically, the oracle user owns the installation.

  3. Verify that the same user name (such as oracle) exists on Oracle Big Data Appliance.

  4. Download the bds-exa-install.sh installation script from the node where Mammoth is installed, typically the first node in the cluster. You can use a command such as wget or curl. This example copies the script from bda1node07:

    wget http://bda1node07/bda/bds-exa-install.sh
    
  5. As root, run the script and pass it the system identifier (SID). In this example, the SID is orcl:

    ./bds-exa-install.sh oracle_sid=orcl
    

    Note: If the Oracle installation owner is not oracle, then use the --install-user option. See "Running the bds-exa-install Script".

  6. Repeat step 5 for each database instance.

When the script completes, Oracle Big Data SQL is running on the database instance. However, if events cause the Oracle Big Data SQL agent to stop, then you must restart it. See "Starting and Stopping the Big Data SQL Agent".

6.2.3.1 Running the bds-exa-install Script

The bds-exa-install script generates a custom installation script that is run by the owner of the Oracle home directory. That secondary script installs all the files need by Oracle Big Data SQL into the $ORACLE_HOME/bigdatasql directory. It also creates the database directory objects, and the database links for the multithreaded Oracle Big Data SQL agent.

If the operating system user who owns Oracle home is not named oracle, then use the --install-user option to specify the owner.

Alternatively, you can use the --generate-only option to create the secondary script, and then run it as the owner of $ORACLE_HOME.

6.2.3.2 bds-ex-install Syntax

The following is the bds-exa-install syntax:

./bds-exa-install.sh oracle_sid=name [option]

The option names are preceded by two hyphens (--):

--generate-only={true | false}

Set to true to generate the secondary script, but not run it, or false to generate and run it in one step (default).

--install-user=user_name

The operating system user who owns the Oracle Database installation. The default values is oracle.

6.3 Creating External Tables for Accessing Big Data

The SQL CREATE TABLE statement has a clause specifically for creating external tables. The information that you provide in this clause enables the access driver to read data from an external source and prepare the data for the external table.

6.3.1 About the Basic CREATE TABLE Syntax

The following is the basic syntax of the CREATE TABLE statement for external tables:

CREATE TABLE table_name (column_name datatype, 
                         column_name datatype[,...]) 
   ORGANIZATION EXTERNAL (external_table_clause);

You specify the column names and data types the same as for any other table. ORGANIZATION EXTERNAL identifies the table as an external table.

The external_table_clause identifies the access driver and provides the information that it needs to load the data. See "About the External Table Clause".

6.3.2 Creating an Oracle External Table for Hive Data

You can easily create an Oracle external table for data in Apache Hive. Because the metadata is available to Oracle Database, you can query the data dictionary for information about Hive tables. Then you can use a PL/SQL function to generate a basic SQL CREATE TABLE EXTERNAL ORGANIZATION statement. You can then modify the statement before execution to customize the external table.

6.3.2.1 Obtaining Information About a Hive Table

The DBMS_HADOOP PL/SQL package contains a function named CREATE_EXTDDL_FOR_HIVE. It returns the data dictionary language (DDL) to create an external table for accessing a Hive table. This function requires you to provide basic information about the Hive table:

  • Name of the Hadoop cluster

  • Name of the Hive database

  • Name of the Hive table

  • Whether the Hive table is partitioned

You can obtain this information by querying the ALL_HIVE_TABLES data dictionary view. It displays information about all Hive tables that you can access from Oracle Database.

This example shows that the current user has access to an unpartitioned Hive table named RATINGS_HIVE_TABLE in the default database. A user named JDOE is the owner.

SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables;
 
CLUSTER_ID   DATABASE_NAME  OWNER    TABLE_NAME         PARTITIONED
------------ -------------- -------- ------------------ --------------
hadoop1      default        jdoe     ratings_hive_table  UN-PARTITIONED

You can query these data dictionary views to discover information about

6.3.2.2 Using the CREATE_EXTDDL_FOR_HIVE Function

With the information from the data dictionary, you can use the CREATE_EXTDDL_FOR_HIVE function of DBMS_HADOOP. This example specifies a database table name of RATINGS_DB_TABLE in the current schema. The function returns the text of the CREATE TABLE command in a local variable named DDLout, but does not execute it.

DECLARE 
   DDLout VARCHAR2(4000);
BEGIN
   dbms_hadoop.create_extddl_for_hive(
      CLUSTER_ID=>'hadoop1',
      DB_NAME=>'default',
      HIVE_TABLE_NAME=>'ratings_hive_table',
      HIVE_PARTITION=>FALSE,
      TABLE_NAME=>'ratings_db_table',
      PERFORM_DDL=>FALSE,
      TEXT_OF_DDL=>DDLout
   );
   dbms_output.put_line(DDLout);
END;
/

When this procedure runs, the PUT_LINE function displays the CREATE TABLE command:

CREATE TABLE ratings_db_table (   
   c0 VARCHAR2(4000),
   c1 VARCHAR2(4000),
   c2 VARCHAR2(4000),
   c3 VARCHAR2(4000),
   c4 VARCHAR2(4000),
   c5 VARCHAR2(4000),
   c6 VARCHAR2(4000),
   c7 VARCHAR2(4000))
ORGANIZATION EXTERNAL
   (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR 
   ACCESS PARAMETERS
      (
       com.oracle.bigdata.cluster=hadoop1
       com.oracle.bigdata.tablename=default.ratings_hive_table
      )
   ) PARALLEL 2 REJECT LIMIT UNLIMITED

You can capture this information in a SQL script, and use the access parameters to change the Oracle table name, the column names, and the data types as desired before executing it. You might also use access parameters to specify a date format mask.

The ALL_HIVE_COLUMNS view shows how the default column names and data types are derived. This example shows that the Hive column names are C0 to C7, and that the Hive STRING data type maps to VARCHAR2(4000):

SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns;
 
TABLE_NAME            COLUMN_NAME  HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE
--------------------- ------------ ---------------- ------------------
ratings_hive_table    c0           string           VARCHAR2(4000)
ratings_hive_table    c1           string           VARCHAR2(4000)
ratings_hive_table    c2           string           VARCHAR2(4000)
ratings_hive_table    c3           string           VARCHAR2(4000)
ratings_hive_table    c4           string           VARCHAR2(4000)
ratings_hive_table    c5           string           VARCHAR2(4000)
ratings_hive_table    c6           string           VARCHAR2(4000)
ratings_hive_table    c7           string           VARCHAR2(4000)
 
8 rows selected.

6.3.2.3 Developing a CREATE TABLE Statement for ORACLE_HIVE

You can choose between using DBMS_HADOOP and developing a CREATE TABLE statement from scratch. In either case, you may need to set some access parameters to modify the default behavior of ORACLE_HIVE.

6.3.2.3.1 Using the Default ORACLE_HIVE Settings

The following statement creates an external table named ORDER to access Hive data:

CREATE TABLE order (cust_num    VARCHAR2(10), 
                    order_num   VARCHAR2(20), 
                    description VARCHAR2(100),
                    order_total NUMBER (8,2)) 
   ORGANIZATION EXTERNAL (TYPE  oracle_hive);

Because no access parameters are set in the statement, the ORACLE_HIVE access driver uses the default settings to do the following:

  • Connects to the default Hadoop cluster.

  • Uses a Hive table named order. An error results if the Hive table does not have fields named CUST_NUM, ORDER_NUM, DESCRIPTION, and ORDER_TOTAL.

  • Sets the value of a field to NULL if there is a conversion error, such as a CUST_NUM value longer than 10 bytes.

6.3.2.3.2 Overriding the Default ORACLE_HIVE Settings

You can set properties in the ACCESS PARAMETERS clause of the external table clause, which override the default behavior of the access driver. The following clause includes the com.oracle.bigdata.overflow access parameter. When this clause is used in the previous example, it truncates the data for the DESCRIPTION column that is longer than 100 characters, instead of throwing an error:

(TYPE oracle_hive
 ACCESS PARAMETERS (
    com.oracle.bigdata.overflow={"action:"truncate", "col":"DESCRIPTION"} ))

The next example sets most of the available parameters for ORACLE_HIVE:

CREATE TABLE order (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL 
  (TYPE oracle_hive
     ACCESS PARAMETERS (
        com.oracle.bigdata.tablename:  order_db.order_summary
        com.oracle.bigdata.colmap:     {"col":"ITEM_CNT", \
                                        "field":"order_line_item_count"}
        com.oracle.bigdata.overflow:   {"action":"TRUNCATE", \
                                        "col":"DESCRIPTION"}
        com.oracle.bigdata.erroropt:   [{"action":"replace", \
                                         "value":"INVALID_NUM" , \
                                         "col":["CUST_NUM","ORDER_NUM"]} ,\
                                        {"action":"reject", \
                                         "col":"ORDER_TOTAL}
]

The parameters make the following changes in the way that the ORACLE_HIVE access driver locates the data and handles error conditions:

  • com.oracle.bigdata.tablename: Handles differences in table names. ORACLE_HIVE looks for a Hive table named ORDER_SUMMARY in the ORDER.DB database.

  • com.oracle.bigdata.colmap: Handles differences in column names. The Hive ORDER_LINE_ITEM_COUNT field maps to the Oracle ITEM_CNT column.

  • com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.

  • com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or ORDER_NUM set the value to INVALID_NUM.

6.3.3 Creating an External Table for HDFS Files

The ORACLE_HDFS access driver enables you to access many types of data that are stored in HDFS, but which do not have Hive metadata. You can define the record format of text data, or you can specify a SerDe for a particular data format.

You must create the external table for HDFS files manually, and provide all the information the access driver needs to locate the data, and parse the records and fields. The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL statements.

6.3.3.1 Using the Default Access Parameters with ORACLE_HDFS

The following statement creates a table named ORDER to access the data in all files stored in the /usr/cust/summary directory in HDFS:

CREATE TABLE ORDER (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_total (NUMBER 8,2))
ORGANIZATION EXTERNAL (TYPE oracle_hdfs)
LOCATION ("hdfs:/usr/cust/summary/*");

Because no access parameters are set in the statement, the ORACLE_HDFS access driver uses the default settings to do the following:

  • Connects to the default Hadoop cluster.

  • Reads the files as delimited text, and the fields as type STRING.

  • Assumes that the number of fields in the HDFS files match the number of columns (three in this example).

  • Assumes the fields are in the same order as the columns, so that CUST_NUM data is in the first field, ORDER_NUM data is in the second field, and ORDER_TOTAL data is in the third field.

  • Rejects any records in which the value causes a data conversion error: If the value for CUST_NUM exceeds 10 characters, the value for ORDER_NUM exceeds 20 characters, or the value of ORDER_TOTAL cannot be converted to NUMBER.

6.3.3.2 Overriding the Default ORACLE_HDFS Settings

You can use many of the same access parameters with ORACLE_HDFS as ORACLE_HIVE.

6.3.3.2.1 Accessing a Delimited Text File

The following example is equivalent to the one shown in "Overriding the Default ORACLE_HIVE Settings". The external table access a delimited text file stored in HDFS.

CREATE TABLE order (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER8,2)) ORGANIZATION EXTERNAL 
             (TYPE oracle_hdfs
 ACCESS PARAMETERS (
  com.oracle.bigdata.colmap:     {"col":"item_cnt", \
                                  "field":"order_line_item_count"}
  com.oracle.bigdata.overflow:   {"action":"TRUNCATE", \
                                  "col":"DESCRIPTION"}
  com.oracle.bigdata.erroropt:   [{"action":"replace", \
                                   "value":"INVALID NUM", \
                                   "col":["CUST_NUM","ORDER_NUM"]} , \
                                  {"action":"reject", \
                                   "col":"ORDER_TOTAL}]
 )
 LOCATION ("hdfs:/usr/cust/summary/*"));

The parameters make the following changes in the way that the ORACLE_HDFS access driver locates the data and handles error conditions:

  • com.oracle.bigdata.colmap: Handles differences in column names. ORDER_LINE_ITEM_COUNT in the HDFS files matches the ITEM_CNT column in the external table.

  • com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.

  • com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or ORDER_NUM set the value to INVALID_NUM.

6.3.3.2.2 Accessing Avro Container Files

The next example uses a SerDe to access Avro container files.

CREATE TABLE order (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER8,2)) ORGANIZATION EXTERNAL 
             (TYPE oracle_hdfs
 ACCESS PARAMETERS (
  com.oracle.bigdata.rowformat: \
     SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  com.oracle.bigdata.fileformat: \
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'\ 
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  com.oracle.bigdata.colmap: {  "col":"item_cnt", \
                                "field":"order_line_item_count"}
  com.oracle.bigdata.overflow: {"action":"TRUNCATE", \
                                "col":"DESCRIPTION"}
 
 LOCATION ("hdfs:/usr/cust/summary/*"));

The access parameters provide the following information to the ORACLE_HDFS access driver:

  • com.oracle.bigdata.rowformat: Identifies the SerDe that the access driver needs to use to parse the records and fields. The files are not in delimited text format.

  • com.oracle.bigdata.fileformat: Identifies the Java classes that can extract records and output them in the desired format.

  • com.oracle.bigdata.colmap: Handles differences in column names. ORACLE_HDFS matches ORDER_LINE_ITEM_COUNT in the HDFS files with the ITEM_CNT column in the external table.

  • com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters for the DESCRIPTION column are truncated.

6.4 About the External Table Clause

CREATE TABLE ORGANIZATION EXTERNAL takes the external_table_clause as its argument. It has the following subclauses:

See Also:

Oracle Database SQL Language Reference for the external_table_clause

6.4.1 TYPE Clause

The TYPE clause identifies the access driver. The type of access driver determines how the other parts of the external table definition are interpreted.

Specify one of the following values for Oracle Big Data SQL:

  • ORACLE_HDFS: Accesses files in an HDFS directory.

  • ORACLE_HIVE: Accesses a Hive table.

Note:

The ORACLE_DATAPUMP and ORACLE_LOADER access drivers are not associated with Oracle Big Data SQL.

6.4.2 DEFAULT DIRECTORY Clause

The DEFAULT DIRECTORY clause identifies an Oracle Database directory object. The directory object identifies an operating system directory with files that the external table reads and writes.

ORACLE_HDFS and ORACLE_HIVE use the default directory solely to write log files on the Oracle Database system.

6.4.3 LOCATION Clause

The LOCATION clause identifies the data source.

6.4.3.1 ORACLE_HDFS LOCATION Clause

The LOCATION clause for ORACLE_HDFS contains a comma-separated list of file locations. The files must reside in the HDFS file system on the default cluster.

A location can be any of the following:

  • A fully qualified HDFS name, such as /user/hive/warehouse/hive_seed/hive_types. ORACLE_HDFS uses all files in the directory.

  • A fully qualified HDFS file name, such as /user/hive/warehouse/hive_seed/hive_types/hive_types.csv

  • A URL for an HDFS file or a set of files, such as hdfs:/user/hive/warehouse/hive_seed/hive_types/*. Just a directory name is invalid.

The file names can contain any pattern-matching character described in Table 6-1.

Table 6-1 Pattern-Matching Characters

Character Description

?

Matches any one character

*

Matches zero or more characters

[abc]

Matches one character in the set {a, b, c}

[a-b]

Matches one character in the range {a...b}. The character must be less than or equal to b.

[^a]

Matches one character that is not in the character set or range {a}. The carat (^) must immediately follow the left bracket, with no spaces.

\c

Removes any special meaning of c. The backslash is the escape character.

{ab\,cd}

Matches a string from the set {ab, cd}. The escape character (\) removes the meaning of the comma as a path separator.

{ab\,c{de\,fh}

Matches a string from the set {ab, cde, cfh}. The escape character (\) removes the meaning of the comma as a path separator.


6.4.3.2 ORACLE_HIVE LOCATION Clause

Do not specify the LOCATION clause for ORACLE_HIVE; it raises an error. The data is stored in Hive, and the access parameters and the metadata store provide the necessary information.

6.4.4 REJECT LIMIT Clause

Limits the number of conversion errors permitted during a query of the external table before Oracle Database stops the query and returns an error.

Any processing error that causes a row to be rejected counts against the limit. The reject limit applies individually to each parallel query (PQ) process. It is not the total of all rejected rows for all PQ processes.

6.4.5 ACCESS PARAMETERS Clause

The ACCESS PARAMETERS clause provides information that the access driver needs to load the data correctly into the external table. See "CREATE TABLE ACCESS PARAMETERS Clause".

6.5 About Data Type Conversions

When the access driver loads data into an external table, it verifies that the Hive data can be converted to the data type of the target column. If they are incompatible, then the access driver returns an error. Otherwise, it makes the appropriate data conversion.

Hive typically provides a table abstraction layer over data stored elsewhere, such as in HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from its stored format into a Hive data type. The access driver then converts the data from its Hive data type to an Oracle data type. For example, if a Hive table over a text file has a BIGINT column, then the SerDe converts the data from text to BIGINT. The access driver then converts the data from BIGINT (a Hive data type) to NUMBER (an Oracle data type).

Performance is better when one data type conversion is performed instead of two. The data types for the fields in the HDFS files should therefore indicate the data that is actually stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file is text. If the Hive type for a field is DATE, then the SerDe converts the data from string (in the data file) to a Hive date. Then the access driver converts the data from a Hive date to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not perform a conversion, and the access driver converts the data from string to an oracle date. Queries against the external table are faster in the second example, because the access driver performs the only data conversion.

Table 6-2 identifies the data type conversions that ORACLE_HIVE can make when loading data into an external table.

Table 6-2 Supported Hive to Oracle Data Type Conversions

Hive Data Type VARCHAR2, CHAR, NCHAR2, NCHAR, CLOB NUMBER, FLOAT, BINARY_NUMBER, BINARY_FLOAT BLOB RAW DATE, TIMESTAMP, TIMESTAMP WITH TZ, TIMESTAMP WITH LOCAL TZ INTERVAL YEAR TO MONTH, INTERVAL DAY TO SECOND

INT

SMALLINT

TINYINT

BIGINT

yes

yes

yes

yes

no

no

DOUBLE

FLOAT

yes

yes

yes

yes

no

no

DECIMAL

yes

yes

no

no

no

no

BOOLEAN

yesFoot 1 

yesFoot 2 

yesFootref 2

yes

no

no

BINARY

yes

no

yes

yes

no

no

STRING

yes

yes

yes

yes

yes

yes

TIMESTAMP

yes

no

no

no

yes

no

STRUCT

ARRAY

UNIONTYPE

MAP

yes

no

no

no

no

no


Footnote 1 FALSE maps to the string FALSE, and TRUE maps to the string TRUE.

Footnote 2 FALSE maps to 0, and TRUE maps to 1.

6.6 Querying External Tables

Users can query external tables using the SQL SELECT statement, the same as they query any other table.

6.6.1 Granting User Access

Users who query the data on a Hadoop cluster must have READ access in Oracle Database to the external table and to the database directory object that points to the cluster directory. See "About the Cluster Directory".

6.6.2 About Error Handling

By default, a query returns no data if an error occurs while the value of a column is calculated. Processing continues after most errors, particularly those thrown while the column values are calculated.

Use the com.oracle.bigdata.erroropt parameter to determine how errors are handled.

6.6.3 About the Log Files

You can use these access parameters to customize the log files:

6.7 About Oracle Big Data SQL on Oracle Exadata Database Machine

Oracle Big Data SQL runs exclusively on systems with Oracle Big Data Appliance connected to Oracle Exadata Database Machine. The Oracle Exadata Storage Server Software is deployed on a configurable number of Oracle Big Data Appliance servers. These servers combine the functionality of a CDH node and an Oracle Exadata Storage Server.

The Mammoth utility on installs the Big Data SQL software on both Oracle Big Data Appliance and Oracle Exadata Database Machine. The information in this section explains the changes that Mammoth makes to the Oracle Database system.

This section contains the following topics:

Note:

Oracle SQL Connector for HDFS provides access to Hadoop data for all Oracle Big Data Appliance racks, including those that are not connected to Oracle Exadata Database Machine. However, it does not offer the performance benefits of Oracle Big Data SQL, and it is not included under the Oracle Big Data Appliance license. See Oracle Big Data Connectors User's Guide.

6.7.1 Starting and Stopping the Big Data SQL Agent

The agtctl utility starts and stops the multithreaded Big Data SQL agent. It has the following syntax:

agtctl {startup | shutdown} bds_clustername

6.7.2 About the Common Directory

The common directory contains configuration information that is common to all Hadoop clusters. This directory is located on the Oracle Database system under the Oracle home directory. The oracle file system user (or whichever user owns the Oracle Database instance) owns the common directory. A database directory named ORACLE_BIGDATA_CONFIG points to the common directory.

6.7.3 Common Configuration Properties

The Mammoth installation process creates the following files and stores them in the common directory:

The Oracle DBA can edit these configuration files as necessary.

6.7.3.1 bigdata.properties

Thebigdata.properties file in the common directory contains property-value pairs that define the Java class paths and native library paths required for accessing data in HDFS.

These properties must be set:

The following list describes all properties permitted in bigdata.properties.

bigdata.properties 

bigdata.cluster.default

The name of the default Hadoop cluster. The access driver uses this name when the access parameters do not specify a cluster. Required.

Changing the default cluster name might break external tables that were created previously without an explicit cluster name.

bigdata.cluster.list

A comma-separated list of Hadoop cluster names. Optional.

java.classpath.hadoop

The Hadoop class path. Required.

java.classpath.hive

The Hive class path. Required.

java.classpath.oracle

The path to the Oracle JXAD Java JAR file. Required.

java.classpath.user

The path to user JAR files. Optional.

java.libjvm.file

The full file path to the JVM shared library (such as libjvm.so). Required.

java.options

A comma-separated list of options to pass to the JVM. Optional.

This example sets the maximum heap size to 2 GB, and verbose logging for Java Native Interface (JNI) calls:

Xmx2048m,-verbose=jni
LD_LIBRARY_PATH

A colon separated (:) list of directory paths to search for the Hadoop native libraries. Recommended.

If you set this option, then do not set java.library path in java.options.

Example 6-1 shows a sample bigdata.properties file.

Example 6-1 Sample bigdata.properties File

# bigdata.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      bigdata.properties - Big Data Properties File
#
#    DESCRIPTION
#      Properties file containing parameters for allowing access to Big Data
#      Fixed value properties can be added here 
#
 
java.libjvm.file=$ORACLE_HOME/jdk/jre/lib/amd64/server/libjvm.so
java.classpath.oracle=$ORACLE_HOME/hadoopcore/jlib/*:$ORACLE_HOME/hadoop/jlib/hver-2/*:$ORACLE_HOME/dbjava/lib/*
java.classpath.hadoop=$HADOOP_HOME/*:$HADOOP_HOME/lib/*
java.classpath.hive=$HIVE_HOME/lib/*
LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib
bigdata.cluster.default=hadoop_cl_1

6.7.3.2 bigdata-log4j.properties

The bigdata-log4j.properties file in the common directory defines the logging behavior of queries against external tables in the Java code. Any log4j properties are allowed in this file.

Example 6-2 shows a sample bigdata-log4j.properties file with the relevant log4j properties.

Example 6-2 Sample bigdata-log4j.properties File

# bigdata-log4j.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      bigdata-log4j.properties - Big Data Logging Properties File
#
#    DESCRIPTION
#      Properties file containing logging parameters for Big Data
#      Fixed value properties can be added here
 
bigsql.rootlogger=INFO,console
log4j.rootlogger=DEBUG, file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.logger.oracle.hadoop.sql=ALL, file
 
bigsql.log.dir=.
bigsql.log.file=bigsql.log
log4j.appender.file.File=$ORACLE_HOME/bigdatalogs/bigdata-log4j.log

See Also:

Apache Logging Services documentation at

http://logging.apache.org/log4j/1.2/manual.html

6.7.4 About the Cluster Directory

The cluster directory contains configuration information for a CDH cluster. Each cluster that Oracle Database will access using Oracle Big Data SQL has a cluster directory. This directory is located on the Oracle Database system under the common directory. For example, a cluster named bda1_cl_1 would have a directory by the same name (bda1_cl_1) in the common directory.

The cluster directory contains the CDH client configuration files for accessing the cluster, such as the following:

  • core-site.xml

  • hdfs-site.xml

  • hive-site.xml

  • mapred-site.xml (optional)

  • log4j property files (such as hive-log4j.properties)

A database directory object points to the cluster directory. Users who want to access the data in a cluster must have read access to the directory object.

See Also:

"Providing Remote Client Access to CDH" for a more detailed discussion of Hadoop clients.

6.7.5 About Permissions

The oracle operating system user (or whatever user owns the Oracle Database installation directory) must have the following setup:

  • READ/WRITE access to the database directory that points to the log directory. These permissions enable the access driver to create the log files, and for the user to read them.

  • A corresponding oracle operating system user defined on Oracle Big Data Appliance, with READ access in the operating system to the HDFS directory where the source data is stored.