Preface

The Oracle Big Data SQL User's Guide describes how to use and manage the Oracle Big Data SQL product.

Audience

This guide is intended for administrators and users of Oracle Big Data SQL, including:

  • Application developers

  • Data analysts

  • Data scientists

  • Database administrators

  • System administrators

The guide assumes that the reader has basic knowledge of Oracle Database single-node and multinode systems, the Hadoop framework, the Linux operating system, and networking concepts.

Related Documents

See the Oracle Big Data SQL User’s Guide for instructions on using the product.

The following publications provide information about the use of Oracle Big Data SQL with the Oracle Big Data Appliance and Oracle Big Data Connectors:

You can find more information about Oracle’s Big Data solutions and Oracle Database at the Oracle Help Center

For more information on Hortonworks HDP and Ambari, refer to the Hortonworks documentation site at http://docs.hortonworks.com/index.html.

For more information on Cloudera CDH and Configuration Manager, see http://www.cloudera.com/documentation.html

Conventions

The following text conventions are used in this document:

Convention Meaning

boldface

Boldface type indicates graphical user interface elements associated with an action, or terms defined in text or the glossary.

italic

Italic type indicates book titles, emphasis, or placeholder variables for which you supply particular values.

monospace

Monospace type indicates commands within a paragraph, URLs, code in examples, text that appears on the screen, or text that you enter.

# prompt

The pound (#) prompt indicates a command that is run as the Linux root user.

Backus-Naur Form Syntax

The syntax in this reference is presented in a simple variation of Backus-Naur Form (BNF) that uses the following symbols and conventions:

Symbol or Convention Description

[ ]

Brackets enclose optional items.

{ }

Braces enclose a choice of items, only one of which is required.

|

A vertical bar separates alternatives within brackets or braces.

...

Ellipses indicate that the preceding syntactic element can be repeated.

delimiters

Delimiters other than brackets, braces, and vertical bars must be entered as shown.

boldface

Words appearing in boldface are keywords. They must be typed as shown. (Keywords are case-sensitive in some, but not all, operating systems.) Words that are not in boldface are placeholders for which you must substitute a name or value.

Changes in Oracle Big Data SQL 3.2.1

Oracle Big Data SQL Release 3.2.1 includes several features that are not available in Release 3.2. There are no other enhancements or defect fixes in Release 3.2.1. Release 3.2 features are also included in this summary.

Support for Oracle Database 12.2.0.1

Release 3.2.1 can be installed on both Oracle Database 12.2.0.1 and 12.1.0.2. The same Oracle Big Data SQL 3.2.1 installation on a Hadoop cluster can connect to both Oracle Database 12.2.0.1 and 12.1.0.2.

Customers using Oracle Big Data SQL 3.2 do not need to upgrade to Release 3.2.1 if they connect to Oracle Database 12.1.0.2 only.

Note:

Patches must be applied on Oracle Database 12.x and Oracle Grid Infrastructure (when used). The applicable patches depend on the database version in use. Please refer to the Oracle Support Document 2119369.1 (Oracle Big Data SQL Master Compatibility Matrix) for details.

Exclusion of Edge Nodes

The installation now automatically excludes Hadoop edge nodes. In earlier releases, the BDS agent was installed on edge nodes in order to enable Copy to Hadoop. The BDS agent checks for installation prerequisites and these checks would sometimes fail on edge nodes. The BDS agent is no longer installed by default on edge nodes.

In addition, the Oracle Big Data SQL configuration file now has an optional excluded_nodes parameter where you can manually exclude nodes. It is possible, though not likely, that the installer may not identify a node as an edge node. You can use this parameter to manually identify missed edge nodes and then re-run the installer, but should rarely need to do so.

This parameter is in the network section of the configuration file.

   excluded_nodes :  [ "<fqdn_for_edge_node 1>","<fqdn_for_edge_node_2>","<fqdn_for_edge_node_3>" ]

See Also:

Known Limitations

JSON CLOB Predicate Pushdown not yet Available in Release 3.2.1

Oracle Big Data SQL 3.2 includes a performance optimization that enables Oracle Big Data SQL to filter JSON data locally on its processing cells in Hadoop for CLOB columns up to 1 MB. This optimization is not currently included in Release 3.2.1 and will be added in a later DBBP (Database Bundle Patch).

 Storage Indexes not Utilized Under Some Conditions

Storage Indexes are a query processing optimization that avoids data scans. In Release 3.2.1, Storage Indexes for an Oracle Big Data SQL table may not be utilized where the following DDL operation is performed:
ALTER TABLE table_name PROJECT COLUMN REFERENCED; 

To correct this problem, recreate the external table.

Changes Inherited From Oracle Big Data SQL 3.2

Oracle Big Data SQL Release 3.2 includes major improvements in performance, secure network connectivity, authentication, and user administration, as well as installation and configuration.

Support for Querying Kafka Topics

Release 3.2 provides Hive and Oracle Big Data SQL the ability to query Kafka topics via a new Hive storage handler. You can use this storage handler to create external Hive tables backed by data residing in Kafka. Oracle Big Data SQL or Hive can then query the Kafka data through the external tables.  The Kafka key, value, offset, topic name, and partition id are mapped to Hive columns.   You can explicitly designate the offset for each topic/partition pair, otherwise the offset will start from the earliest offset in the topic and end with the latest offset in the topic for each partition.  

Improved Processing of Parquet Files

Oracle has introduced its own Parquet reader for processing data in Parquet format. This new reader provides significant performance and resource utilization improvements over the existing Hive Parquet driver. These include:

  • More intelligent column processing retrieval. The reader uses “lazy materialization” to process only columns with rows that satisfy the filter, thereby improving I/O.

  • Leveraging of dictionaries during filter predicate processing to improve CPU usage.

  • Streamlined data conversion, which also contributes to more efficient CPU usage.

The Big Data SQL installation enables the Oracle's Parquet reader by default. You have the option to disable it and revert to the generic Parquet reader.

Multi-User Authorization

In previous releases of Oracle Big Data SQL, all queries against Hadoop and Hive data are executed as the oracle user and there is no option to change users. Although oracle is still the underlying user in all cases, Oracle Big Data SQL 3.2 now uses Hadoop Secure Impersonation to direct the oracle account to execute tasks on behalf of other designated users. This enables HDFS data access based on the user that is currently executing the query, rather than the singular oracle user.

Administrators set up the rules for identifying the query user. They can provide rules for identifying the currently connected user and mapping the connected user to the user that is impersonated. Because there are numerous ways in which users can connect to Oracle Database, this user may be a database user, a user sourced from LDAP, from Kerberos, or a user from other sources. Authorization rules on the files apply for that user and HDFS auditing identifies the actual user running the query.

See Also:

Administration for Multi-User Authorization is done through the DBMS_BDSQL PL/SQL Package, which is documented in the Oracle Big Data SQL User’s Guide.

Authentication Between Oracle Database and Oracle Big Data SQL Cells

This authentication is between Oracle Database and the Big Data SQL cells on the Hadoop cluster, facilitating secure communication. The Database Authentication enhancement provides a safeguard against impersonation attacks, in which a rogue service attempts to connect to the Oracle Big Data offload server process running on a cluster node.

Kerberos Ticket Renewal Automation

On a Kerberos-secured network you can configure the installation to set up automated Kerberos ticket renewal for the oracle account used by Oracle Big Data SQL. This is done for both the Hadoop cluster and Oracle Database sides of the installation. You must provide the principal name and the path to the keytab file.in the bds-config.json configuration file. A template is provided in the configuration file:

"kerberos" : {
"principal" : "oracle/mycluster@MY.DOMAIN.COM",
"keytab" : "/home/oracle/security/oracle.keytab"
}

If you provide the Kerberos parameters in the configuration file, then Oracle Big Data SQL installation sets up cron jobs on both the Hadoop cluster and Oracle Database servers. These jobs renew the Kerberos tickets for the principal once per day.

The principal and keytab file must already exist.

Automatic Upgrade

The current release can now be installed over an earlier release with no need to remove the older software on either the Hadoop or Oracle Database side. The previous installation is upgraded to the current release level.

Common Installation Bundle for all Platforms

In previous releases, customers needed to unpack the Oracle Big Data SQL installation bundle and choose the correct package for their Hadoop system (CDH or HDP). Now the bundle contains a single installation package that works for all supported Hadoop systems.

Simpler and Faster Installation with the new “Jaguar” Installer

The Jaguar installer replaces setup-bds.sh , the installer in previous releases. Jaguar includes these changes:

  • Automatic Check for Installation Prerequisites on Hadoop Nodes

    Jaguar checks for installation readiness on each Hadoop DataNode and reports any missing prerequisites.

  • No Need to Manually Generate the Database-Side Installation Bundle

    The database-side installation bundle that previously was manually generated by the customer can now be generated automatically. You still need to copy the bundle to the Oracle Database nodes and install it.

  • Faster Overall Installation Time on the Hadoop Side

    Installation time will vary, but on the Hadoop Side the installation may take approximately eight minutes if all resources are local, possibly 20 minutes if Hadoop clients must be downloaded from the Internet, depending on download speed.

  • Prerequisite Apache Services on CDH can now be Installed as Either Packages or Parcels

    Previously on CDH systems, the Oracle Big Data SQL installation required that the HDFS, YARN, and HIVE components had been installed as parcels. These components can now be installed on CDH as either packages or parcels. There is no change for HDP, where they must be installed as stacks.

    Note:

    On CDH systems, if the Hadooop services required by Oracle Big Data SQL are installed as packages, be sure that they are installed from within Cloudera Manager. Otherwise, Cloudera Manager will not be able to manage these services. This is not an issue with parcels.
  • In the CLI, the Jaguar utility Replaces ./setup-bds

    The Jaguar utility is now the primary tool for Hadoop-side installation, de-installation, and configuration changes, as in these examples:
    # ./jaguar install bds-config.json
    # ./jaguar reconfigure bds-config.json
    # ./jaguar uninstall bds-config.json 
  • The Default Configuration File Name is bds-config.json, but Alternate File Names are Also Accepted

    You can now drop the explicit bds-config.json argument and allow the installer default to bds-config.json , as in the first example below. You can also specify an alternate configuration file of any name, though it must adhere to the same internal format as bds-config.json and should be given the .json file type.
    # ./jaguar install 
    # ./jaguar install cluster2-config.json
    You can create configurations files with settings that are tailored to the requirements of each cluster. For example, you may want to apply different security parameters to Oracle Big Data SQL installations on test and production clusters.
  • Configuration Parameters Have Changed Significantly

    Users of previous releases will see that the Jaguar configuration file includes a number of new parameters. Most of them are “optional” in the sense that they are not uniformly required, although your particular installation may require some of them. See the Related Links section below for links to the table of installer parameters as well as an example of a configuration file that uses all available parameters.

  • New updatenodes Command for Easier Maintenance

    Oracle Big Data SQL must be installed on each Hadoop cluster node that is provisioned with the DataNode role. It has no function on nodes where DataNode is not present. The new Jaguar utility includes the updatenodes command which scans the cluster for instances of the DataNode within the cluster. If the DataNode role has been removed or relocated, or if nodes provisioned with the DataNode have been added or removed, then the script installs or uninstalls Oracle Big Data SQL components from nodes as needed.

  • An Extra Installation Step is Required to Enable Some Security Features

    If you choose to enable Database Authentication between Oracle Database and Oracle Big Data SQL cells in the Hadoop cluster, or, Hadoop Secure Impersonation, then an additional “Database Acknowledge” step is required. In this process, the installation on the database server generates a ZIP file of configuration information that you must copy back to the Hadoop cluster management server for processing.

  • On the Database Side, Connections to Clusters are no Longer Classified as Primary and Secondary.

    An Oracle Database system can have Oracle Big Data SQL connections to multiple Hadoop clusters. In previous releases, the first these connections was considered the primary (and had to be uninstalled last) and the others were secondary. In the current release, management of multiple installation is simpler and --uninstall-as-primary and --uninstall-as-secondary parameters of the database-side installer are obsolete. However there is now a default cluster. The Important Terms and Concepts section of this guide explains the significance of the default cluster.

Support for Oracle Tablespaces in HDFS Extended to Include All Non-System Permanent Tablespaces

Previous releases supported the move of permanent online tablespaces only to HDFS. This functionality now supports online, read-only, as well as offline permanent tablespaces.

Important Change in Behavior of the “mtactl start” Command

Oracle Big Data SQL 3.1 introduced the option to install Oracle Big Data SQL on servers where Oracle Grid Infrastructure is not present. In these environments, you can use the start subcommand of the mtactl utility (mtactl start) to start the MTA (Multi-Threaded Agent) extproc.

Note that in the current release, the mtactl start command works differently from the original Release 3.1 implementation.

  • Current behavior: mtactl start starts an MTA extproc using the init parameter values that are stored in the repository. It uses the default values only if the repository does not exist.

  • Previous behavior (Oracle Big Data SQL 3.1): mtactl start always uses the default init parameters regardless of whether or not init parameter values are stored in the repository.

See Also:

Resource Requirements

8 CPU cores and 12 GB of RAM are now recommended for each node of the Hadoop cluster.

There are some sysctl settings related to kernel, swap, core memory, and socket buffer size that are strongly recommended for optimal performance. These are part of the installation prerequisites explained in Chapter 2 of the installation guide.