Oracle Shell for Hadoop Loaders

2 Oracle Shell for Hadoop Loaders

Oracle Big Data Connectors are a powerful toolset for data interchange between Hadoop and Oracle Database. Oracle Shell for Hadoop Loaders (OHSH) is an ease-of-use tool for using Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and the Copy to Hadoop feature of Big Data SQL.

What is Oracle Shell for Hadoop Loaders?

Oracle Shell for Hadoop Loaders is a helper shell that provides an easy-to-use command line interface to Oracle Loader for Apache Hadoop, Oracle SQL Connector for HDFS, and the Copy to Hadoop feature of Big Data SQL. It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.

The core functionality of Oracle Shell for Hadoop Loaders includes the following:

Defining named external resources with which Oracle Shell for Hadoop Loaders interacts to perform loading tasks.
Setting default values for load operations.
Running load commands.
Delegating simple pre and post load tasks to the Operating System, HDFS, Hive and Oracle. These tasks include viewing the data to be loaded, and viewing the data in the target table after loading.

Configure Oracle Shell for Hadoop Loaders (OHSH)

Note:

For instructions on configuring OHSH to run on Oracle Big Data Service, see Use Big Data Connectors and Copy to Hadoop to Copy Data Between Big Data Service and a Database Instance in Using Oracle Big Data Service.

For OHSH on a Hadoop cluster or edge node, or on the Oracle Database System, confirm the location of the OHSH install and set OHSH_HOME. For example:

Set OHSH_HOME.

$ export OHSH_HOME=/opt/oracle/ohsh-5.0.0

Edit $OHSH_HOME/bin/ohsh_config.sh to set the following environment variables:
1. OLH_HOME: Installation location of Oracle Loader for Hadoop. For example:
```
export OLH_HOME=/opt/oracle/oraloader-5.0.0-h2
```
2. HS2_HOST_PORT: Hiveserver2 server and port information. For example:
```
export HS2_HOST_PORT = <server>:<port> 
```
3. HADOOP_CONF_DIR: Hadoop conf directory. For example:
```
export HADOOP_CONF_DIR=/etc/hadoop/conf 
```
4. HIVE_CONF_DIR: Hive conf directory. For example:
```
export HIVE_CONF_DIR=/etc/hive/conf 
```
5. TNS_ADMIN: To identify the Oracle Database.
  - If you are using a standard JDBC connection instead of SSL or Oracle Wallet, then you need to copy tnsnames.ora and sqlnet.ora from $TNS_ADMIN on the database host to a <directory_location> on the system where you plan to run OHSH.
```
export TNS_ADMIN=<directory_location>
```
  - If you are using Oracle Wallet to store database credentials, then create the wallet file on the database host and copy it to a <wallet_location> on the system where you plan to run OHSH. In addition to TNS_ADMIN, set WALLET_LOCATION_DIR to the location of the Wallet file.
```
export WALLET_LOCATION=<wallet_location>
```
    Note:
    Make sure that sqlnet.ora has wallet enabled and points to the correct wallet location. For example:
    WALLET_LOCATION= (SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=<wallet_location>))) sqlnet.wallet_override=true
  - If you are using SSL, TNS_ADMIN is the directory where the SSL client wallet files reside. For example:
```
export TNS_ADMIN=/home/oracle/SSL_wallet
```
6. SQLCLIENT_HOME: Database client installation directory. For example:
```
export SQLCLIENT_HOME=/usr/lib/oracle/12.1/client64 
```
7. Set HIVE_SESS_VAR_LIST for any required Hive session variables. For example:
  - To connect to HiveServer2 with Kerberos authentication:
```
export HIVE_SESS_VAR_LIST="principal=<The server principal of HiveServer2>"
```
    The Hive principal is specified by the configuration property hive.server2.authentication.kerberos.principal in hive-site.xml.
  - To connect to HiveServer2 running in HTTP mode:
```
export HIVE_SESS_VAR_LIST="transportMode=http;
httpPath=<The HTTP endpoint>"
```
  - To connect to HiveServer2 when SSL is enabled:
```
export HIVE_SESS_VAR_LIST="ssl=true;
sslTrustStore=<Trust Store path>;
trustStorePassword=<Trust Store password>"
```
Add $OHSH_HOME/bin/ohsh to your $PATH variable.
```
$ export PATH=$PATH:$OHSH_HOME/bin/ohsh
```

Get Started with Oracle Shell for Hadoop Loaders

Starting an OHSH Interactive Session

To start an interactive session, enter ohsh on the command line. Remember to clear HADOOP_CLASSPATH before invoking OHSH. This brings you to the OHSH shell (if you have OHSH in your path):

$ ohsh
ohsh>

You can execute OHSH commands in this shell (using the OHSH syntax). You can also execute commands for Beeline/Hive, Hadoop, Bash, and SQL*Plus. For non-OHSH commands, you add a delegation operator prefix (“%”) to the name of the resource used to execute the command. For example:

ohsh> %bash0 ls
                                —l

Scripting OHSH

You can also script the same commands that work in the CLI. The ohsh command provides three parameters for working with scripts.

ohsh —i <filename>.ohsh

The —i parameter tells OHSH to initialize an interactive session with the commands in the script before the prompt appears. This is a useful way to set up the required session resources and automate other preliminary tasks before you start working within the shell.
```
$ ohsh –i initresources.ohsh
```
ohsh —f <filename>.ohsh

The ohsh command with the —f parameter starts a non-interactive session and runs the commands in the script.
```
$ ohsh –f myunattendedjobs.ohsh
```
ohsh —i —f <filename>.ohsh

You can use —i and —f together to initialize a non-interactive session and then run another script in the session.
```
$ ohsh -i mysetup.ohsh –f myunattendedjobs.ohsh
```
ohsh —c

This command dumps all Hadoop configuration properties that an OHSH session inherits at start up.

Working With OHSH Resources

A resource is some named entity that OHSH interacts with. For example: a Hadoop cluster is a resource, as is a JDBC connection to an Oracle database, a Hive database, a SQL session with an Oracle database, and a Bash shell on the local OS.

OHSH provides two default resources at start up: hive0 (to connect to the default Hive database) and hadoop0.

Using hive0 resource to execute a Hive command:
```
ohsh> %hive0 show tables
```
You can create additional Hive resources to connect to other Hive databases.
Using the hadoop0 resource to execute a Hadoop command:
```
ohsh> %hadoop0 fs -ls
```

Within an interactive or scripted session, you can create instances of additional resources, such as SSL, SQL, and JDBC. You need to create these three resources in order to connect to Oracle Database through OHSH.

Creating a SQL resource:
```
ohsh> create sqlplus resource sql0 connectid=”bigdatalite.localdomain:1521/orcl”
```
Note:
At the prompt, enter database username and password.
Creating a JDBC resource:
```
ohsh> create jdbc resource jdbc0 connectid=<database connection URL>
```
Note:
At the prompt, enter database username and password.
Showing resources:
```
ohsh> show resources
```

This command lists default resources and any additional resources created within the session.

Creating Resources Using JDBC SSL to Connect Oracle Database

You need to create these resources in order to connect to Oracle Database using SSL.

Creating a SSL resource . See Using JDBC SSL to download the SSL client wallet and complete the configuration steps.
Creating a SQL resource:
```
ohsh> create sqlplus resource sql_ssl connectid="<tns entry name for the SSL connection>"
```
For example:
```
ohsh> create sqlplus resource sql_ssl connectid="inst1_ssl"
```
inst1_ssl is the TNS entry for the JDBC SSL connection

Note:
At the prompt, enter database username and password.
Creating a JDBC resource:
```
ohsh> create jdbc resource jdbc_ssl 
connectiondir=<SSL wallet directory location>
connectid="<tns entry name for the SSL connection>"
```
For example:
```
ohsh> create jdbc resource ora_mydatabase_cs 
connectiondir=/home/oracle/ssl_client_wallet 
connectid="inst1_ssl"
```
inst1_ssl is the TNS entry for the JDBC SSL connection

Note:
At the prompt, enter database username and password.

Creating Resources Using Secure Java KeyStore

You can store passwords in the secure Java KeyStore. If you use the Java KeyStore, then you won't be prompted for the username and password. You can also add this to the scripts you develop to load data.

Creating a Java KeyStore. See Using Secure External Java KeyStore and Hadoop credential command

Creating a SQL resource Using a Java KeyStore:

ohsh> create sqlplus resource sql_ssl_cs user=<username>
passwordalias=<password alias>  
provider="<provider_path>"
connectid="<tns entry name for the SSL connection>"

For example:

ohsh> create sqlplus resource sql_ssl_cs user=oracle 
passwordalias=oracle_passwd
provider="jceks://file/home/oracle/passwd.jceks"
connectid="inst1"

Creating a SQL resource Using Java KeyStore with JDBC SSL connection:

ohsh> create sqlplus resource sql_ssl_cs user=<username> 
passwordalias=<password alias>  
provider="<provider_path>"
connectiondir=<SSL wallet directory location> 
connectid="<tns entry name for the SSL connection>

For example:

ohsh> create sqlplus resource sql_ssl_cs user=oracle 
passwordalias=oracle_passwd
provider="jceks://file/home/oracle/passwd.jceks"
connectiondir=/home/oracle/ssl_client_wallet
connectid="inst1_ssl"

Creating a JDBC resource Using a Java KeyStore:

ohsh> create jdbc resource jdbc_ssl_cs user=<username>     
passwordalias=<password alias>    
provider="<provider_path>" 
connectid="<tns entry name for the SSL connection>"

For example:

ohsh> create jdbc resource jdbc_ssl_cs user=oracle 
passwordalias=oracle_passwd
provider="jceks://file/home/oracle/passwd.jceks"
connectid="inst1"

Creating a JDBC resource Using Java KeyStore with JDBC SSL connection:

ohsh> create jdbc resource jdbc_ssl_cs user=<username>     
passwordalias=<password alias>    
provider="<provider_path>" 
connectiondir=<SSL wallet directory location> 
connectid="<tns entry name for the SSL connection>"

For example:

ohsh> create jdbc resource jdbc_ssl_cs user=oracle 
passwordalias=oracle_passwd
provider="jceks://file/home/oracle/passwd.jceks"
connectiondir=/home/oracle/ssl_client_wallet 
connectid="inst1_ssl"

Getting Help

The OHSH shell provides online help for all commands.

To get a list of all OHSH commands:

ohsh> help

To get help on a specific command, enter help, followed by the command:

ohsh> help show

The table below describes the help categories available.

Help Command	Description
`help load`	Describes load commands for Oracle and Hadoop tables.
`help set`	Shows help for setting defaults for load operations. It also describes what load methods are impacted by a particular setting.
`help show`	Shows help for inspecting default settings.
`help shell`	Shows shell-like commands.
`help resource`	Show commands for creating and dropping named resources.

Load an Oracle Database Table

You can use the OHSH load command to load data from HDFS or Hive into an Oracle Database table using OLH or OSCH or to load data from an Oracle Database table into Hive using Copy to Hadoop.

The following load options are available for loading from HDFS or Hive into Oracle Database:

jdbc (the default for Oracle tables that are not partitioned) - load from HDFS or Hive using the OLH JDBC Output Format
directpath (the default for partitioned Oracle tables) - load from HDFS or Hive using the OLH OCI Direct Path Output Format
exttab - load from HDFS or Hive using OSCH
etl - load CSV content in HDFS using OLH and OSCH together. OLH extracts and transforms the delimited content in HDFS files into datapump files in HDFS. OSCH then loads the datapump files from HDFS into a target table in Oracle
etl deferred - Use OLH to transform the delimited content in HDFS files into data pump files in HDFS. The data pump files are left there for OSCH to load at a later time. The OHSH script for doing the load is generated in the local example directory and can be manually executed to complete the load

The following load options are available for loading from an Oracle Database table into Hive:

directcopy (default) - Uses the Copy to Hadoop feature of Oracle Big Data SQL to connect to Oracle Database and copy data from an Oracle table into a Datapump file stored in HDFS
stage - Connects to the Oracle Database and generates Datapump format files containing the table data and metadata. Copies the datapump file from the Database to HDFS. Then uses the Copy to Hadoop SerDes to enable Hive to read the datapump file

See the Using Oracle Shell for Hadoop Loaders With Copy to Hadoop in Oracle Big Data SQL User's Guide for more information about using OHSH with Copy to Hadoop.

Use the load command to load files from HDFS into a target table in the Oracle database.

The following command loads data from a delimited text file in HDFS <HDFS path> into the target table in Oracle Database using the direct path option.
```
ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpath
```
Note:

The default direct path method is the fastest way to load a table. However, it requires partitioned target table. Direct path is always recommended for use with partition tables. Use the JDBC option to load into a non-partitioned target table.

If the command does not explicitly state the load method, then OHSH automatically uses the appropriate method. If the target Oracle table is partitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). If the Oracle table is not partitioned, it uses JDBC.
After loading, check the number of rows.

You can do this conveniently from the OHSH command line:
```
ohsh> %sql0 select count(*) from <target table in Oracle Database>
```

Loading a Hive Table Into an Oracle Database Table

You can use OHSH to load a Hive table into a target table in an Oracle database. The command below shows how to do this using the direct path method.

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>

Note that if the target table is partitioned, then OHSH uses direct path automatically. You do not need to enter using directpath explicitly in the command.

If the target table is non-partitioned, then specify the JDBC method instead:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc

Note:

The load command assumes that the column names in the Hive table and in the Oracle Database table are identically matched. If they do not match, then use OHSH loadermap.

Using OHSH Loadermaps

The simple load examples in this section assume the following:

Where we load data from a text file in Hadoop into an Oracle Database table, the declared order of columns in the target table maps correctly to the physical ordering of the delimited text fields in the file.
Where we load Hive tables in to Oracle Database tables, the Hive and Oracle Database column names are identically matched.

However, in less straightforward cases where the column names (or the order of column names and delimited text fields) do not match, use the OHSH loadermap construct to correct these mismatches.

You can also use a loadermap to specify a subset of target columns to load into table or in the case of a load from a text file, specify the format of a field in the load.

Loadermaps are not covered in this introduction.

Performance Tuning Oracle Loader for Hadoop in OHSH

Aside from network bandwidth, two factors can have significant impact on Oracle Loader for Hadoop performance. You can tune both in OHSH.

Degree of parallelism

The degree of parallelism affects performance when Oracle Loader for Hadoop runs in Hadoop. For the default method (direct path), parallelism is determined by the number of reducer tasks. The higher the number of reducer tasks, the faster the performance. The default value is 4. To set the number of tasks:
```
ohsh> set reducetasks 18
```
For the JDBC option, parallelism is determined by the number of map tasks and the optimal number is determined automatically. However, remember that if the target table is partitioned, direct path is faster than JDBC.
Load balancing

Performance is best when the load is balanced evenly across reduce tasks. The load is detected by sampling. Sampling is always enabled by default for loads using the JDBC and the default copy method.

Debugging in OHSH

Several OHSH settings control the availability of debugging information:

outputlevel

The outputlevel is set to minimal by default. Set it to verbose in order to return a stack trace when a command fails:
```
ohsh> set outputlevel verbose
```
logbadrecords
```
ohsh> set logbadrecords true
```
This is set to true by default.

These log files are informative for debugging:

Oracle Loader for Hadoop log files.

/user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh

Log files generated by the map and reduce tasks.

Other OHSH Properties That are Useful for Oracle Loader for Hadoop

You can set these properties on the OHSH command line or in a script. These values persist for the current session only.

dateformat
```
ohsh> set dateformat “yyyy-MM-dd HH:mm:ss”
```
The syntax for this command is dictated by the Java date format.
rejectlimit

The number of rows that can be rejected before the load of a delimited text file fails.
fieldterminator

The field terminator in loads of delimited text files.
hadooptnsadmin

Location of an Oracle TNS admin directory in the Hadoop cluster
hadoopwalletlocation

Location of the Oracle Wallet directory in the Hadoop cluster.

Using the exttab (External Table) Method to Load Data

A third option to load data from Hadoop into Oracle Database is exttab.

Note:

The exttab option is available in on-premises deployments of OHSH only. It is not available in Oracle cloud services

In the exttab, data is loaded via external tables. OHSH creates the external table using Oracle SQL Connector for HDFS, and then uses a Create table as Select statement to load the data into the target table:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using exttab

Learning Resources

These OHSH blog entries can help you get started.

See the Java™ Platform, Standard Edition 7 API Specification for documentation on the SimpleDateFormat class.