3 Storing Oracle Data in Hadoop

Not Big Data SQL Cloud Service Topic This topic does not apply to Oracle Big Data SQL Cloud Service.

Copy to Hadoop and Oracle Database Tablespaces in HDFS are two Oracle Big Data SQL resources for off-loading Oracle Database tables to the HDFS file system on a Hadoop cluster.

The table below compares these two tools.

Table 3-1 Comparison of Copy to Hadoop and Oracle Tablespaces in HDFS

Copy to Hadoop Oracle Tablespaces in HDFS
Copies Oracle Database tables to Oracle Data Pump files stored in HDFS. Oracle Database tables or partitions are stored within the tablespace in HDFS in their original Oracle-internal format.
Access is through a Hive external table and from the database with Oracle Big Data SQL. Access is directly though the original Oracle Database tables. External tables are not needed.
Data is available (through Hive) to other processes in the Hadoop ecosystem and to Oracle Database (through Oracle Big Data SQL). Data is directly available to Oracle Database only. Data is not accessible to other processes in Hadoop.

3.1 Using Copy to Hadoop

Not Big Data SQL Cloud Service Topic This topic does not apply to Oracle Big Data SQL Cloud Service.

This section describes how to use Copy to Hadoop to copy Oracle Database tables to Hadoop.

3.1.1 What Is Copy to Hadoop?

Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed through the command-line interface Oracle Shell for Hadoop Loaders.

Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. The Oracle Data Pump files can be queried by Hive or Big Data SQL. The Oracle Data Pump format optimizes queries through Big Data SQL in the following ways:

  • The data is stored as Oracle data types – eliminating data type conversions.

  • The data is queried directly – without requiring the overhead associated with Java SerDes.

After Data Pump format files are in HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use this utility.

3.1.2 Getting Started Using Copy to Hadoop

Installing Copy to Hadoop

To install Copy to Hadoop: Follow the Copy to Hadoop and Oracle Shell for Hadoop Loaders installation procedures in the Oracle Big Data SQL Installation Guide.

As described in the installation guide, ensure that the prerequisite software is installed on both the Hadoop cluster (on Oracle Big Data Appliance or another Hadoop system) and on the Oracle Database server (Oracle Exadata Database Machine or other).

Starting OHSH and Using Copy to Hadop

  1. Invoke Oracle Shell for Hadoop Loaders (OHSH) to do a direct, one-step copy or a staged, two-step copy of data in Oracle Database to Data Pump format files in HDFS, and create a Hive external table from the files.
    OHSH will choose directcopy by default to do a direct, one-step copy. This is faster than a staged, two-step copy and does not require storage on the database server. However, there are situations where you should do a staged, two-step copy:
    • Copying columns from multiple Oracle Database source tables. (The direct, one-step copy copies data from one table only.)

    • Copying columns of type TIMESTAMPTZ or TIMESTAMPLTZ to Hive.

      Since Hive does not have a data type that supports time zones or time offsets, you must cast these columns to TIMESTAMP when manually exporting these columns to Data Pump files

    • Copying data from a view. Views are not supported by the directcopy option.

     The staged two-step copy using the manual steps is demonstrated in "Appendix A: Manual Steps for Using Copy to Hadoop for Staged Copies".

  2. Query this Hive table the same as you would any other Hive table.

Tip:

For Hadoop power users with specialized requirements, the manual option for Direct Copy is recommended. See Manual Steps for Using Copy to Hadoop for Direct Copies in Appendix B.

3.1.2.1 Table Access Requirements for Copy to Hadoop

To copy a table using Copy to Hadoop, an Oracle Database user must meet one of these requirements.

  • The user is the owner of the table, or

  • The user is accessing a table in another schema and has the following privileges:

    • The SELECT privilege on the table.

    • The select_catalog_role privilege (which provides SELECT  privileges on data dictionary views).

3.1.3 Using Oracle Shell for Hadoop Loaders With Copy to Hadoop

3.1.3.1 Introducing Oracle Shell for Hadoop Loaders

What is Oracle Shell for Hadoop Loaders?

Oracle Shell for Hadoop Loaders (OHSH) is a helper shell that provides an easy-to-use command line interface to Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and Copy to Hadoop. It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line. 

The core functionality of Oracle Shell for Hadoop Loaders includes the following:

  • Defining named external resources with which Oracle Shell for Hadoop Loaders interacts to perform loading tasks.

  • Setting default values for load operations.

  • Running load commands.

  • Delegating simple pre and post load tasks to the Operating System, HDFS, Hive and Oracle. These tasks include viewing the data to be loaded, and viewing the data in the target table after loading.

See Also:

(This topic applies to on-premises Oracle Big Data SQL only.)
  • To set up OHSH, follow the instructions in the Oracle Big Data SQL Installation Guide.

  • The examples directory in the OHSH kit contains many examples that define resources and load data using Oracle Shell for Hadoop Loaders.  See <OHSH_KIT>/examples/README.txt for a description of the examples and instructions on how to run OHSH load methods.

3.1.4 Copy to Hadoop by Example

3.1.4.1 First Look: Loading an Oracle Table Into Hive and Storing the Data in Hadoop

These set of examples show how to use Copy to Hadoop to load data from an Oracle table, store the data in Hadooop, and perform related operations with the OHSH shell. It assumes that OHSH and Copy to Hadoop are already installed and configured.

What’s Demonstrated in These Examples

These examples demonstrate the follow tasks:

  • Starting an OHSH session and create the resources you’ll need for Copy to Hadoop.

  • Using Copy to Hadoop to copy the data from the selected Oracle Database table to a new Hive table in Hadoop (using the resources that you created).

  • Using the load operation to add more data to the Hive table created in the first example.

  • Using the create or replace operation to drop the Hive table and replace it with a new one that has a different record set.

  • Querying the data in the Hive table and in the Oracle Database table.

  • Converting the data into other formats

Tip:

You may want to create select or create a small table in Oracle Database and work through these steps as first exercise.

Starting OHSH, Creating Resources, and Running Copy to Hadoop

  1. Start OHSH. (The startup command below assumes that you’ve added the OHSH path to your PATH variable as recommended.)

    $ ohsh
    ohsh>
    
  2. Create the following resources.

    • SQL*Plus resource.

      ohsh> create sqlplus resource sql0 connectid=”<database_connection_url>”
      
    • JDBC resource.

      ohsh> create jdbc resource jdbc0 connectid=”<database_connection_url>”
      

    Note:

    For the Hive access shown in this example, only the default hive0 resource is needed. This resource is already configured to connect to the default Hive database. If additional Hive resources were required, you would create them as follows:
    ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive_database_name>”
    
  3. Include the Oracle Database table name in the create hive table command below and run the command below. This command uses the Copy to Hadoop directcopy method. Note that directcopy is the default mode and you do not actually need to name it explicitly.

    ohsh> create hive table hive0:<new_Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> from oracle table jdbc0:<Oracle_Database_table_name> using directcopy
    

    Your Oracle Table data would now be stored in Hadoop as a Hive table.

Adding More Data to the Hive Table

Use the OHSH load method to add data to an existing Hive table.

Let’s assume that the original Oracle table includes a time field in the format DD-MM-YY and that a number of daily records were added after the Copy to Hadoop operation that created the corresponding Hive table.

Use load to add these new records to the existing Hive table:

ohsh> load hive table hive0:<Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> where “(time >= ’01-FEB-18’)”

Using OHSH create or replace

The OHSH create or replace operation does the following:

  1. Drops the named Hive table (and the associated Data Pump files) if a table by this name already exists.

    Note:

    Unlike create or replace, A create operation fails and returns an error if the Hive table and the related Data Pump files already exist.

  2. Creates a new Hive table using the name provided.

Suppose some records were deleted from the original Oracle Database table and you want to realign the Hive table with the new state of the Oracle Database table. Hive does not support update or delete operations on records, but the create or replace operation in OHSH can achieve the same end result:

ohsh> create or replace hive table hive0:<new_hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name>

Note:

Data copied to Hadoop by Copy to Hadoop can be queried through Hive, but the data itself is actually stored as Oracle Data Pump files. Hive only points to the Data Pump files

Querying the Hive Table

You can invoke a Hive resource in OHSH in order to run HiveQL commands. Likewise, you can invoke an SQL*Plus resource to run SQL commands. For example, these two queries compare the original Oracle Database table with the derivative Hive table:

ohsh> %sql0 select count(*) from <Oracle_Database_table_name>
ohsh> %hive0 select count(*) from <Hive_table_name>

Storing Data in Other Formats, Such as Parquet or ORC

By default, Copy to Hadoop outputs Data Pump files. In a create operation, you can use the “stored as” syntax to change the destination format to Parquet or ORC:

ohsh> %hive0 create table <Hive_table_name_parquet> stored as parquet as select * from <Hive_table_name>

This example creates the Data Pump files, but then immediately copies them to Parquet format. (The original Data Pump files are not deleted.)

See Also:

This section of the documentation is an overview of OHSH and Copy to Hadoop. There is more detail in the blog entry below. Be sure to visit it as well as the other OHSH entries in the Oracle blog space.

3.1.4.2 Working With the Examples in the Copy to Hadoop Product Kit

The OHSH product kit provides an examples directory at the path where OHSH is installed. This section walks you through several examples from the kit.

3.1.4.2.1 Using Copy to Hadoop With the Default Copy Method

The section assumes that OHSH and Copy to Hadoop are installed and configured.

The following examples from the Copy to Hadoop product kit show how to use Copy to Hadoop with the default method of loading data. You can find the code in the examples directory where the kit is installed (<OHSH_KIT>/examples ).

Example 3-1 createreplace_directcopy.ohsh

This script uses the create or replace operation to create a Hive external table called cp2hadoop_fivdti from the Oracle table FIVDTI. It then loads the Hive table with 10000 rows.  It uses the default load methoddirectcopy to run a map job on Hadoop and split the Oracle table into input splits. The resulting Hive external table includes all of the splits.

create or replace hive table hive0:cp2hadoop_fivdti \ 
from oracle table olhp:fivdti using directcopy 

In the example below and in the code samples that follow, olhp is a user-defined JDBC resource.

Example 3-2 load_directcopy.ohsh

The load_directcopy.ohsh script shows how to load the Hive table that was created in createreplace_directcopy.ohsh with an additional 30 rows. This script also uses the directcopy method.

load hive table hive0:cp2hadoop_fivdti from oracle table olhp:fivdti \
using directcopy where "(i7 < 30)";

Tip:

You have the option to convert the storage in Hadoop from the default Data Pump format to Parquet or ORC format. For example:
%hive0 create table cp2hadoop_fivdti_parquet stored as parquet as select * from cp2hadoop_fivdti

The original Data Pump files are not deleted.

3.1.4.2.2 Using Copy to Hadoop With the Staged Copy Method

The first example below shows how to use Oracle Shell for Hadoop Loaders (OHSH) with Copy to Hadoop to do a staged, two-step copy from Oracle Database to Hadoop. The stage method is an alternative to the directcopy method.

The second example shows how to load additional rows into the same table. It also uses the stage method.

Both examples assume that OHSH and Copy to Hadoop have been installed and configured, and that the examples have been configured according to the instructions in README.txt in the examples directory of the OHSH installation. The scripts below and many others are available in the examples directory.

Example 3-3 createreplace_stage.ohsh

This script uses create or replace to create a Hive table called cp2hadoop_fivdti from the Oracle table FIVDTI.  It uses the stage command, which automatically does the following: 

  1. Exports the contents of the source table in Oracle to Data Pump format files on local disk

  2. Moves the Data Pump format files to HDFS.

  3. Creates the Hive external table that maps to the Data Pump format files in HDFS.

create or replace hive table hive0:cp2hadoop_fivdti \
from oracle table olhp:fivdti using stage

In the command above (and also in the next code example), olhp is a user-defined JDBC resource.

Example 3-4 load_stage.ohsh

The load_stage.ohsh script shows how to load the Hive table created by createreplace_stage.ohsh with an additional 30 rows using the stage method.

load hive table hive0:cp2hadoop_fivdti from oracle table olhp:fivdti \
using stage where "(i7 < 30)";

Manual Option

The two-step method demonstrated in the createreplace_stage.ohsh and load_stage.ohsh example scripts automates some of the tasks required to do staged copies. However, there may be reasons to perform the steps manually, such as:

  • You want to load columns from multiple Oracle Database source tables.

  • You want to load columns of type TIMESTAMPZ or TIMESTAMPLTZ.

See Appendix A: Manual Steps for Using Copy to Hadoop for Staged Copies.

3.1.5 Querying the Data in Hive

The following OHSH command shows the number of rows in the Hive table after copying from the Oracle table.

%hive0 select count(*) from cp2hadoop_fivdti;

3.1.6 About Column Mappings and Data Type Conversions

3.1.6.1 About Column Mappings

The Hive table columns automatically have the same names as the Oracle columns, which are provided by the metadata stored in the Data Pump files. Any user-specified column definitions in the Hive table are ignored.

3.1.6.2 About Data Type Conversions

Copy to Hadoop automatically converts the data in an Oracle table to an appropriate Hive data type. Table 3-2 shows the default mappings between Oracle and Hive data types.

Table 3-2 Oracle to Hive Data Type Conversions

Oracle Data Type Hive Data Type

NUMBER

INT when the scale is 0 and the precision is less than 10

BIGINT when the scale is 0 and the precision is less than 19

DECIMAL when the scale is greater than 0 or the precision is greater than 19

CLOB

NCLOB

STRING

INTERVALYM

INTERVALDS

STRING

BINARY_DOUBLE

DOUBLE

BINARY_FLOAT

FLOAT

BLOB

BINARY

ROWID

UROWID

BINARY

RAW

BINARY

CHAR

NCHAR

CHAR

VARCHAR2

NVARCHAR2

VARCHAR

DATE

TIMESTAMP

TIMESTAMP

TIMESTAMP

TIMESTAMPTZ

TIMESTAMPLTZFoot 1

Unsupported

Footnote 1

To copy TIMESTAMPTZ and TIMESTAMPLTZ data to Hive, follow the instructions in Appendix A: Manual Steps for Using Copy to Hadoop to do Staged Copies. Cast the columns to TIMESTAMP when exporting them to the Data Pump files.

3.1.7 Working With Spark

The Oracle Data Pump files exported by Copy to Hadoop can be used in Spark.  

The Spark installation must be configured to work with Hive.  Launch a Spark shell by specifying the Copy to Hadoop jars.
prompt> spark-shell --jars orahivedp.jar,ojdbc7.jar,oraloader.jar,orai18n.jar,ora-hadoop-common.jar
Verify the type of sqlContext in spark-shell:
scala> sqlContext
Your output will look like the following:
 res0:org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@66ad7167
If the default sqlContext is not HiveContext, create it:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
You can now create a Data Frame df that points to a Hive external table over Oracle Data Pump files:
scala> val df = sqlContext.table("<hive external table>") <hive external table>: 
org.apache.spark.sql.DataFrame = [ <column names> ]
Now you can access data via the data frame.
scala> df.count
scala> df.head
If a Hive external table had not been created and you only had the Oracle Data Pump files created by Copy to Hadoop, you can create the Hive external table from within Spark.
scala> sqlContext.sql(“CREATE EXTERNAL TABLE <hive external table> ROW FORMAT SERDE 
'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT
'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
'/user/oracle/oracle_warehouse/<hive database name>'")

3.1.8 Using Oracle SQL Developer with Copy to Hadoop

Oracle SQL Developer is a free, GUI-based development environment that provides easy to use tools for working Oracle Big Data Connectors, including Copy to Hadoop.

Using Oracle SQL Developer, you can copy data and create a new Hive table, or append data to an existing Hive external table that was created by Copy to Hadoop. In the GUI, you can initiate Copy to Hadoop in Oracle SQL Developer by right-clicking the Tables icon under any Hive schema. You can then append to an existing Hive external table by right-clicking the icon for that Hive table.

See Installing Oracle SQL Developer in this manual for instructions on where to obtain Oracle SQL Developer and how do the basic installation.

See Also:

Oracle SQL Developer User’s Guide for complete details on Oracle SQL Developer.

3.2 Storing Oracle Tablespaces in HDFS

Not Big Data SQL Cloud Service Topic This topic does not apply to Oracle Big Data SQL Cloud Service.

You can store Oracle read-only tablespaces on HDFS and use Big Data SQL Smart Scan to off-load query processing of data stored in that tablespace to the Hadoop cluster. Big Data SQL Smart Scan performs data local processing - filtering query results on the Hadoop cluster prior to the return of the data to Oracle Database. In most circumstances, this can be a significant performance optimization. In addition to Smart Scan, querying tablespaces in HDFS also leverages native Oracle Database access structures and performance features. This includes features such as indexes, Hybrid Columnar Compression, Partition Pruning, and Oracle Database In-Memory.

Tables, partitions, and data in tablespaces in HDFS retain their original Oracle Database internal format. This is not a data dump. Unlike other means of accessing data in Hadoop (or other noSQL systems), you do not need to create Oracle External table. After copying the corresponding Oracle tablespaces to HDFS, you refer to the original Oracle table to access the data.

Permanent online, read only, and offline tablespaces (including ASM tablespaces) are eligible for the move to HDFS.

Note:

Since tablespaces allocated to HDFS are may not be altered, offline tablespaces must remain as offline. For offline tablespaces, then, what this feature provides is a hard backup into HDFS.

If you want to use Oracle SQL Developer to perform the operations in this section, confirm that you can access the Oracle Database server from your on-premises location. This typically requires a VPN connection.

3.2.1 Advantages and Limitations of Tablespaces in HDFS

The following are some reasons to store Oracle Database tablespaces in HDFS.

  • Because the data remains in Oracle Database internal format, I/O requires no resource-intensive datatype conversions.

  • All Oracle Database performance optimizations such as indexing, Hybrid Columnar Compression, Partition Pruning, and Oracle Database In-Memory can be applied.

  • Oracle user-based security is maintained. Other Oracle Database security features such as Oracle Data Redaction and ASO transparent encryption remain in force if enabled. In HDFS, tablespaces can be stored in zones under HDFS Transparent HDFS encryption.

  • Query processing can be off-loaded. Oracle Big Data SQL Smart Scan is applied to Oracle Database tablespaces in HDFS. Typically, Smart Scan can provide a significant performance boost for queries. With Smart Scan, much of the query processing workload is off-loaded to the Oracle Big Data SQL server cells on the Hadoop cluster where the tablespaces reside. Smart Scan then performs predicate filtering in-place on the Hadoop nodes to eliminate irrelevant data so that only data that meets the query conditions is returned to the database tier for processing. Data movement and network traffic are reduced to the degree that smart scan predicate filtering can distill the dataset before returning it to the database.

  • For each table in the tablespace, there is only a single object to manage – the Oracle-internal table itself. To be accessible to Oracle Database, data stored in other file formats typically used in HDFS requires an overlay of an external table and a view.

  • As is always the case with Oracle internal partitioning, partitioned tables and indexes can have partitions in different tablespaces some of which may be in Exadata , ZFSSA, and other storage devices. This feature adds HDFS as another storage option.

There are some constraints on using Oracle tablespaces in HDFS. As is the case with all data stored in HDFS, Oracle Database tables, partitions, and data stored in HDFS are immutable. Updates are done by deleting and replacing the data. This form of storage is best suited to off-loading tables and partitions for archival purposes. Also, with the exception of OD4H, data in Oracle tablespaces in HDFS is not accessible to other tools in the Hadoop environment, such as Spark, Oracle Big Data Discovery, and Oracle Big Data Spatial and Graph.

3.2.2 About Tablespaces in HDFS and Data Encryption

Oracle Database Tablespaces in HDFS can work with ASO ( Oracle Advanced Security) transparent table encryption as well as HDFS Transparent Encryption in HDFS.

Tablespaces With Oracle Database ASO Encryption

In Oracle Database, ASO transparent encryption may be enabled for a tablespace or objects within the tablespace. This encryption is retained if the tablespace is subsequently moved to HDFS. For queries against this data, the CELL_OFFLOAD_DECRYPTION setting determines whether Oracle Big Data SQL or Oracle Database decrypts the data.

  • If CELL_OFFLOAD_DECRYPTION = TRUE, then the encryption keys are sent to the Oracle Big Data server cells in Hadoop and data is decrypted at the cells.

  • If CELL_OFFLOAD_DECRYPTION = FALSE , encryption keys are not sent to the cells and therefore the cells cannot perform TDE decryption. The data is returned to Oracle Database for decryption.

The default value is TRUE.

Note:

In cases where CELL_OFFLOAD_DECRYPTION is set to FALSE, Smart Scan cannot read the encrypted data and is unable to provide the performance boost that results from the Hadoop-side filtering of the query result set. TDE Column Encryption prevents Smart Scan processing of the encrypted columns only. TDE Tablespace Encryption prevents Smart Scan processing of the entire tablespace.

Tablespaces in HDFS Transparent Encryption Zones

You can move Oracle Database tablespaces into zones under HDFS Transparent Encryption with no impact on query access or on the ability of Smart Scan to filter data.

3.2.3 Moving Tablespaces to HDFS

Oracle Big Data SQL provides two options for moving tablespaces from Oracle Database to the HDFS file system in Hadoop.

  • Using bds-copy-tbs-to-hdfs

    The script bds-copy-tbs-to-hdfs.sh lets you select a preexisting tablespace in Oracle Database. The script automates the move of the selected tablespace to HDFS and performs necessary SQL ALTER operations and datafile permission changes for you. The DataNode where the tablespace is relocated is predetermined by the script. The script uses FUSE-DFS to move the datafiles from Oracle Database to the HDFS file system in the Hadoop cluster .

    You can find bds-copy-tbs-to-hdfs.sh in the bds-database-install directory where the Oracle Big Data SQL database bundle was extracted.

  • Manually Moving Tablespaces to HDFS

    As an alternative to bds-copy-tbs-to-hdfs.sh, you can manually perform the steps to move the tablespaces to HDFS. You can either move an existing tablespace, or, create a new tablespace and selectively add tables and partitions that you want to off-load. In this case, you can set up either FUSE-DFS or an HDFS NFS gateway service to move the datafiles to HDFS.

The scripted method is more convenient. The manual method is somewhat more flexible. Both are supported.

3.2.3.1 Using bds-copy-tbs-to-hdfs

On the Oracle Database server, you can use the script bds-copy-tbs-to-hdfs.sh to select and move Oracle tablespaces to HDFS. This script is in the bds-database-install directory that you extracted from the database installation bundle when you installed Oracle Big Data SQL.

Syntax

Not Big Data SQL Cloud Service Topic This topic does not apply to Oracle Big Data SQL Cloud Service.

bds-copy-tbs-to-hdfs.sh syntax is as follows:

bds-copy-tbs-to-hdfs.sh
bds-copy-tbs-to-hdfs.sh --install
bds-copy-tbs-to-hdfs.sh --tablespace=<tablespace name> [-pdb=<pluggable database name>]
bds-copy-tbs-to-hdfs.sh --list=<tablespace name> [--pdb=<pluggable database name>]
bds-copy-tbs-to-hdfs.sh --show=<tablespace name> [--pdb=<pluggable database name>]

Table 3-3 bds-copy-tbs-to-hdfs Parameter Options

Parameter List Description
No parameters Returns the FUSE-DFS status.
--install Installs the FUSE-DFS service. No action is taken if the service is already installed.
--tablespace=<tablespace name> [--pdb=<pluggable database name>] Moves the named tablespace in the named PDB to storage in HDFS on the Hadoop cluster. If there are no PDBs, then the --pdb argument is discarded.
--list=<tablespace name> [--pdb=<pluggable database name> Lists tablespaces whose name equals or includes the name provided. The --pdb parameter is an optional scope. --list=* returns all tablespaces. --pdb=* returns matches for the tablespace name within all PDBs.
--show=<tablespace name> [--pdb=<pluggable database name> Shows tablespaces whose name equals or includes the name provided and are already moved to HDFS. The --pdb parameter is an optional scope. --show=* returns all tablespaces. --pdb=* returns matches for the tablespace name within all PDBs.

Usage

Use bds-copy-tbs-to-hdfs.sh to move a tablespace to HDFS as follows.

  1. Log on as the oracle Linux user and cd to the bds-database-install directory where the database bundle was extracted. Find bds-copy-tbs-to-hdfs.sh in this directory.

  2. Check that FUSE-DFS is installed.

    $ ./bds-copy-tbs-to-hdfs.sh
    
  3. Install the FUSE-DFS service (if it was not found in the previous check). This command will also start the FUSE-DFS the service.

    $ ./bds-copy-tbs-to-hdfs.sh --install
    

    If this script does not find the mount point, it launches a secondary script. Run this script as root when prompted. It will set up the HDFS mount. You can run the secondary script in a separate session and then return to this session if you prefer.

    For RAC Databases: Install FUSE_DFS on All Nodes:

    On a RAC database, the script will prompt you that you must install FUSE-DFS on the other nodes of the database.

  4. List the eligible tablespaces in a selected PDB or all PDBs. You can skip this step if you already know the tablespace name and location.

    $ ./bds-copy-tbs-to-hdfs.sh --list=mytablesapce --pdb=pdb1
    
  5. Select a tablespace from the list and then, as oracle, run bds-copy-tbs-to-hdfs.sh again, but this time pass in the --tablespace parameter (and the --pdb parameter if specified). The script moves the tablespace to the HDFS file system.

    $ ./bds-copy-tbs-to-hdfs.sh --tablespace=mytablespace --pdb=pdb1
    

    This command automatically makes the tablespace eligible for Smart Scan in HDFS. It does this in SQL by adding the “hdfs:” prefix to the datafile name in the tablespace definition. The rename changes the pointer in the database control file. It does not change the physical file name.

    For RAC Databases: Replicate Symlinks to Mountpoints on All Nodes:

    On the database node where you are running bds-copy-tbs-to-hdfs.sh , it creates a symlink in $ORACLE_HOME/dbs/hdfs:<cluster name> for each datafile, pointing to the corresponding FUSE-DFS mountpoint. But the script only does this on the local node. You must manually set up the same symlinks on each node of the database.

    For example, if the database consists of three nodes and you run bds-copy-tbs-to-hdfs.sh on Node1, then for each datafile in the copy operation, the script will create a symlink to the mountpoint on Node. In this example there is only one datafile:

    $ ls -l $ORACLE_HOME/dbs/hdfs:<cluster>/<path to datafile>/<datafile name> 
    $ORACLE_HOME/dbs/hdfs:<cluster>/<path to datafile>/<datafile name> -> /mnt/fuse-<cluster>-hdfs/user/oracle/<path to datafile>/<datafile name>
    

    On Node2 and Node3, you can see that the mountpoint exists (if FUSE-DFS was installed):

    $ ls -l /mnt/fuse-<cluster>-hdfs/user/oracle/<path to datafile>/<datafile name>
    $ /mnt/fuse-<cluster>-hdfs/user/oracle/<path to datafile>/<datafile name>
    

    However, bds-copy-tbs-to-hdfs.sh does not create the symlinks to the mountpoint on these other nodes and you must add them manually:

    $ ln -s /mnt/fuse-<cluster>-hdfs/user/oracle/<path to datafile>/<datafile name> $ORACLE_HOME/dbs/hdfs:<cluster>/<path to datafile>/<datafile name>
    

The tablespace should be back online and ready for access when you have completed this procedure.

3.2.3.2 Manually Moving Tablespaces to HDFS

As an alternative to bds-copy-tbs-to-hdfs.sh, you can use the following manual steps to move Oracle tablespaces to HDFS.

Not Big Data SQL Cloud Service Topic This topic does not apply to Oracle Big Data SQL Cloud Service.

Note:

In the case of an ASM tablespace, you must first use RMAN or ASMCMD to copy the tablespace to the filesystem.

Oracle Big Data SQL includes FUSE-DFS and these instructions use it to connect to the HDFS file system. You could use an HDFS NFS gateway service instead. The documentation for your Hadoop distribution should provide the instructions for that method.

Perform all of these steps on the Oracle Database server. Run all Linux shell commands as root. For SQL commands, log on to the Oracle Database as the oracle user.

  1. If FUSE-DFS is not installed or is not started, run bds-copy-tbs-to-hdfs.sh --install . This script will install FUSE-DFS (if it’s not already installed) and then start it.

    The script will automatically create the mount point /mnt/fuse-<clustername>-hdfs.

    Note:

    The script bds-copy-tbs-to-hdfs.sh is compatible with FUSE-DFS 2.8 only.
  2. In SQL, use CREATE TABLESPACE to create the tablespace. Store it in a local .dbf file. After this file is populated, you will move it to the Hadoop cluster. A single, bigfile tablespace is recommended.

    For example:
    SQL> CREATE TABLESPACE movie_cold_hdfs DATAFILE '/u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf' SIZE 100M reuse AUTOEXTEND ON nologging;
    
  3. Use ALTER TABLE with the MOVE clause to move objects in the tablespace.

    For example:
    SQL> ALTER TABLE movie_fact MOVE PARTITION 2010_JAN TABLESPACE movie_cold_hdfs ONLINE UPDATE INDEXES;
    
    You should check the current status of the objects to confirm the change. In this case, check which tablespace the partition belongs to.
    SQL> SELECT table_name, partition_name, tablespace_name FROM user_tab_partitions WHERE table_name='MOVIE_FACT';
    
  4. Make the tablespace read only and take it offline.

    SQL> ALTER TABLESPACE movie_cold_hdfs READ ONLY;
    SQL> ALTER TABLESPACE movie_cold_hdfs OFFLINE;
    
  5. Copy the datafile to HDFS and then change the file permissions to read only.

    hadoop fs -put /u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf /user/oracle/tablespaces/
    hadoop fs –chmod 440 /user/oracle/tablespaces/movie_cold_hdfs1.dbf
    

    As a general security practice for Oracle Big Data SQL , apply appropriate HDFS file permissions to prevent unauthorized read/write access.

    You may need to source $ORACLE_HOME/bigdatasql/hadoop_<clustername>.env before running hadoop fs commands.

    As an alternative, you could use the LINUX cp command to copy the files to FUSE.

  6. Rename the datafiles, using ALTER TABLESPACE with the RENAME DATAFILE clause.

    Important:

    Note the “hdfs:” prefix to the file path in the SQL example below. This is the keyword that tells Smart Scan that it should scan the file. Smart Scan also requires that the file is read only. The cluster name is optional.

    Also, before running the SQL statement below, the directory $ORACLE_HOME/dbs/hdfs:<clustername>/user/oracle/tablespaces should include the soft link movie_cold_hdfs1.dbf, pointing to /mnt/fuse-<clustername>-hdfs/user/oracle/tablespaces/movie_cold_hdfs1.dbf.

    SQL> ALTER TABLESPACE movie_cold_hdfs RENAME DATAFILE '/u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf' TO 'hdfs:<clustername>/user/oracle/tablespaces/movie_cold_hdfs1.dbf';
    

    When you rename the datafile, only the pointer in the database control file changes. This procedure does not physically rename the datafile.

    The tablespace must exist on a single cluster. If there are multiple datafiles, these must point to the same cluster.

  7. Bring the tablespace back online and test it.
    SQL> ALTER TABLESPACE movie_cold_hdfs ONLINE;
    SQL> SELECT avg(rating) FROM movie_fact;
    

Below is the complete code example. In this case we move three partitions from local Oracle Database storage to the tablespace in HDFS.

mount hdfs
select * from dba_tablespaces;

CREATE TABLESPACE movie_cold_hdfs DATAFILE '/u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf' SIZE 100M reuse AUTOEXTEND ON nologging;

ALTER TABLE movie_fact 
MOVE PARTITION 2010_JAN TABLESPACE movie_cold_hdfs ONLINE UPDATE INDEXES;
ALTER TABLE movie_fact 
MOVE PARTITION 2010_FEB TABLESPACE movie_cold_hdfs ONLINE UPDATE INDEXES;
ALTER TABLE movie_fact 
MOVE PARTITION 2010_MAR TABLESPACE movie_cold_hdfs ONLINE UPDATE INDEXES;

-- Check for the changes 
SELECT table_name, partition_name, tablespace_name FROM user_tab_partitions WHERE table_name='MOVIE_FACT';

ALTER TABLESPACE movie_cold_hdfs READ ONLY;
ALTER TABLESPACE movie_cold_hdfs OFFLINE;

hadoop fs -put /u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf /user/oracle/tablespaces/
hadoop fs –chmod 444 /user/oracle/tablespaces/ movie_cold_hdfs1.dbf

ALTER TABLESPACE movie_cold_hdfs RENAME DATAFILE '/u01/app/oracle/oradata/cdb/orcl/movie_cold_hdfs1.dbf' TO 'hdfs:hadoop_cl_1/user/oracle/tablespaces/movie_cold_hdfs1.dbf';
ALTER TABLESPACE movie_cold_hdfs ONLINE;

-- Test
select avg(rating) from movie_fact;

3.2.4 Smart Scan for TableSpaces in HDFS

Smart Scan is an Oracle performance optimization that moves processing to the location where the data resides. In Big Data SQL, Smart Scan searches for datafiles whose path includes the “hdfs:” prefix. This prefix is the key that indicates the datafile is eligible for scanning.

After you have moved your tablespace data to HDFS and the tablespace and have prefixed the datafile path with the "hdfs:" tag, then queries that access the data in these files will leverage Big Data SQL Smart Scan by default. All of the Big Data SQL Smart Scan performance optimizations will apply. This greatly reduces the amount of data that moves from the storage tier to the database tier. These performance optimizations include:

  • The massively parallel processing power of the Hadoop cluster is employed to filter data at its source.

  • Storage Indexes can be leveraged to reduce the amount of data that is scanned.

  • Data mining scoring can be off-loaded.

  • Encrypted data scans can be off-loaded.

Disabling or Enabling Smart Scan

The initialization parameter _CELL_OFFLOAD_HYBRID_PROCESSING determines whether Smart Scan for HDFS is enabled or disabled. It is enabled by default.

To disable Smart Scan for tablespaces in HDFS do the following.

  1. Set the parameter to FALSE in init or in a parameter file:

     _CELL_OFFLOAD_HYBRID_PROCESSING=FALSE 
    

    The underscore prefix is required in this parameter name.

  2. Restart the Oracle Database instance.

You can also make this change dynamically using the ALTER SYSTEM directive in SQL. This does not require a restart.

SQL> alter system set _cell_offload_hybrid_processing=false;

One reason to turn off Smart Scan is if you need to move the Oracle tablespace datafiles out of HDFS and back to their original locations.

You can re-enable Smart Scan by resetting _CELL_OFFLOAD_HYBRID_PROCESSING to TRUE.

Note:

When _CELL_OFFLOAD_HYBRID_PROCESSING is set to FALSE, Smart Scan is disabled for Oracle tablespaces residing in HDFS.