4 Working With Query Server

Learn how to manage Query Server with Oracle Big Data SQL.

4.1 About Oracle Big Data SQL Query Server

Oracle Big Data SQL Query Server is an Oracle Database instance that you can optionally install as a component of Oracle Big Data SQL on an edge node in your Hadoop cluster. You use Query Server to primarily query data stored in the cluster (in HDFS and Hive formats) using Oracle external tables. This enables you to take advantage of the full SQL capabilities provided by the Oracle Database.

You can define external tables using the ORACLE_HDFS and ORACLE_HIVE or ORACLE_BIGDATA access drivers or have the Query Server automatically define external tables based on the metadata in the Hive metastore. In the latter case, Hive databases map to Oracle Database schemas – and the corresponding Hive tables are defined as Oracle external tables in those schemas. All data authorization is based on authorization rules in Hadoop such as Apache Sentry or HDFS Access Controls Lists (ACLs).

Once installed, Query Server provides an Oracle Database deployment that is automatically configured to query data in your Hadoop cluster using SQL. Restarting the Query Server restores the database to a “clean” state, eliminating management overhead. A restart preserves external tables (ORACLE_HIVE, ORACLE_HDFS, and ORACLE_BIGDATA types), associated statistics, user defined views, and credentials. A restart deletes regular tables containing user data.

If your solution requires High Availability (HA), advanced Oracle security policies, or combining data in Oracle Database with data in Hadoop, then you should install a full-blown Oracle Database with Big Data SQL. Oracle supports using both Query Server and a Big Data SQL enabled Oracle Database for a single Hadoop deployment.

To install Query Server, you must specify an existing edge node in your Hadoop cluster in the bds-config.json configuration file. You use the same configuration file to specify a list of Hive databases. Query Server automatically creates Oracle external tables corresponding to the tables in the Hive metastore database(s) so that they are ready for querying after a successful installation. The set of external tables in the Query Server can be automatically kept up-to-date with the corresponding Hive metastore tables by running either the Restart this Big Data SQL Query Server or the Synchronize Hive Databases commands in Cloudera Manager or Apache Ambari cluster management software. You can also use the dbms_bdsqs.sync_hive_databases PL/SQL API package procedure.

See Also:

See Introduction in the Oracle Big Data SQL Installation Guide, which describes how to install and configure the software on the two sides of an Oracle Big Data SQL configuration.

See Storing Oracle Tablespaces in HDFS for instructions on how to set up data files for smart scanning.

4.2 Important Terms and Concepts

Introduction to edge nodes, edge database, cell nodes, and Hadoop cluster integration.

These terms are key to understanding Query Server.

About Edge Nodes

An edge node in a Hadoop cluster is the interface between the Hadoop cluster and the outside network. Typically, edge nodes are used to run client applications and Hadoop cluster administration tools such as Cloudera Manager and Apache Ambari. Edge nodes can act as a data gateway, by providing HDFS access through NFS or HttpFS, or by running REST servers.

About Cell Nodes

The BDS cells run on the DataNodes, and allows for parts of query processing to be pushed down to the Hadoop cluster DataNodes where the data resides. This ensures both load distribution and reduction in the volume of data that needs to be sent to the database for processing. This can result in significant performance improvements on Big Data workloads.

Hadoop Cluster Integration

Oracle Big Data SQL includes the following three service roles that can you can manage in either Cloudera Manager or Apache Ambari:

  • Big Data SQL Query Server: Enables you to run SQL queries against the Hadoop cluster. Applications connect to this server using JDBC or SQL*Net.
  • Big Data SQL Agent: Manages the Big Data SQL installation and is also used by the Copy to Hadoop feature.
  • Big Data SQL Server: Also known as Big Data SQL Cells, allows for parts of query processing to get pushed-down to the Hadoop cluster DataNodes where the data resides.

4.3 Query Server Features

Big Data SQL Query Server provides automatic installation and configuration, integration with Hadoop cluster managers, and automatic integration of cluster metadata:

  • Automatic installation and configuration: Oracle Big Data SQL installer automatically installs and configures Query Server, if you specify an existing target edge node in the Hadoop cluster in the bds-config.json configuration file. To specify the edge node where to install Query Server, you add the edgedb parameter and the node and enabled attributes to the bds-config.json configuration file to as shown in the following example where <edgenode_host_name> is the name of your edge node:
    "edgedb": {
            "node" : "dbnode.domain.com",  
            "enabled" : "true",  
            "sync_hive_db_list" : "my_hive_db_1,my_hive_db2"  
        }

    Note:

    If the bds-config.json configuration file does not include the edgedb subsection, then Query Server is not installed.

    See Also:

    The bds-config.json Configuration Example in the installation guide shows a fully-populated bds-config.json file. The example includes all available configuration parameters.
  • Integration with Hadoop cluster managers: You can monitor and manage Query Server as a service using Cloudera Manager or Apache Ambari Hadoop cluster management tools.
  • Synchronization with Hive: When you start the Oracle Big Data service, Query Server automatically refreshes its metadata from the Hive metastore. After the initial refresh, users can synchronize the Query Server with the latest metadata in the Hive metastore.

4.4 Specifying the Hive Databases to Synchronize With Query Server

Before you can synchronize Query Server with the desired Hive databases in the metastore, you have to specify the list of Hive databases.

Use either of these methods:
  • During installation, specify the sync_hive_db_list parameter in the bds-config.json configuration file.
  • After installation, you can update the sync_hive_db_list configuration parameter in Cloudera Manager or Apache Ambari.

After installing Query Server, it automatically creates schemas and external tables based on the Hive metastore databases list that you specified. Every subsequent Query Server restart will perform a delta synchronization.

4.4.1 Specifying the Hive Databases in the bds-config.json Configuration File

You can provide the initial list of Hive databases to synchronize with Query Server as part of the installation process using the bds-config.json configuration file.

In the configuration file, include the sync_hive_db_list configuration parameter followed by a list of the Hive databases. The following example specifies two Hive databases for the sync_hive_db_list configuration parameter: htdb0 and htdb1. Only these two databases will be synchronized with Query Server, even if the Hive metastore contains other databases.

"edgedb": { 
     "node": "<edgenode_host_name>",
     "enabled": "true",
     "sync_hive_db_list": "htdb0,htdb1"
     . . .
    	}

To synchronize all Hive databases in the metastore with Query Server, use the "*" wildcard character as follows:

"edgedb": { 
     "node": "EdgeNode_Host_Name",
     "enabled": "true"
     "sync_hive_db_list": "*"
     . . .
    	}

If the bds-config.json configuration file does not contain the sync_hive_db_list configuration parameter, then no synchronization will take place between the Hive databases and Query Server. In that case, you must specify the Hive databases using the sync_hive_db_list configuration parameter in Cloudera Manager or Apache Ambari.

Note:

Query Server is not intended to store internal data in Oracle tables. Whenever the Query Server is re-started, it is "reset" to its initial and clean state. This eliminates typical database maintenance such as storage management, database configuration, and so on. The goal of Query Server is to provide a SQL front-end for data in Hadoop, Object Store, Kafka, and NoSQL databases and not a general-purpose RDBMS.

4.4.2 Updating the Hive Databases With the sync_hive_db_list Configuration Parameter

You can update the list of the Hive databases to synchronize with Query Server by using Cloudera Manager.

You can update the list of the Hive databases to synchronize with Query Server by using the sync_hive_db_list configuration parameter in Cloudera Manager as follows:
  1. Login to Cloudera Manager by using your login credentials.
  2. In Cloudera Manager, use the Search field to search for the Synchronized Hive Databases configuration parameter. Enter /Synchronized Hive Databases (or enter part of the name until it is displayed in the list) in the Search field, and then press Enter.
  3. Click the Big Data SQL: Synchronized Hive Databases parameter.
  4. In the Synchronized Hive Databases text box, enter the names of the Hive databases separated by commas, such as htdb0,htdb1, and then click Save Changes. Only these two Hive databases will be synchronized with Query Server.
    To synchronize all Hive databases in the metastore with Query Server, enter the "*" wildcard character in the Synchronized Hive Databases text box, and then click Save Changes .

4.5 Synchronizing Query Server With Hive

You can synchronize the Query Server with the Hive databases that you specified by using Cloudera Manager, Apache Ambari, or the dbms_bdsqs.sync_hive_databases PL/SQL API.

You can synchronize Query Server with the Hive databases in the metastore using one of the following methods:
  • Execute the Restart this Big Data SQL Query Server command in Cloudera Manager or Apache Ambari.
  • Execute the Synchronize Hive Databases command in Cloudera Manager or Apache Ambari.
  • Invoke the dbms_bdsqs.sync_hive_databases PL/SQL API locally on the edge node.
You must specify the Hive databases to use in the synchronization either by using the bds-config.json configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.

Note that the dbms_bdsqs.sync_hive_databases PL/SQL API will only refresh the Hive table definitions for the Hive databases that have already been synchronized through the other two methods.

4.5.1 Restarting Query Server Manually by Using Cloudera Manager

You can synchronize Query Server with the Hive databases that you specified by restarting Query Server in Cloudera Manager or Apache Ambari.

You can use Cloudera Manager or Apache Ambari to manage Query Server such as starting, stopping, and restarting it. When you restart or start Query Server, it synchronizes the metadata with the Hive databases that you specified. Any changes in the Hive databases in the metatore such as dropping or adding tables will be reflected in Query Server. For example, you can restart Query Server in Cloudera Manager as follows:
You must specify the Hive databases to use in the synchronization either by using the bds-config.json configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.
  1. Login to Cloudera Manager using your login credentials.
  2. In the list of available services, click the Big Data SQL link to display the Big Data SQL details page.
  3. From the Status Summary section, click the Big Data SQL Query Server link to display the Big Data SQL Query Server details page.
  4. From the Actions drop-down list, select Restart this Big Data SQL Query Server.
    A dialog box is displayed. Click Restart this Big Data SQL Query Server. Another dialog box is displayed to monitor the status of the synchronization job.

4.5.2 Synchronizing Query Server Manually by Using Cloudera Manager

You can use Cloudera Manager or Apache Ambari to manually synchronize Query Server with the Hive databases that you specified.

After the sychronization, any changes in the Hive databases in the metastore such as dropped or added tables will be reflected in Query Server. For example, you can synchronize Query Server in Cloudera Manager as follows:
  1. Login to Cloudera Manager by using your login credentials.
  2. In the list of available services, click the Big Data SQL link to display the Big Data SQL details page.
  3. From the Status Summary section, click the Big Data SQL Query Server link to display the Big Data SQL Query Server details page.
  4. From the Actions drop-down list, select Synchronize Hive Databases.
    A dialog box is displayed. Click Synchronize Hive Databases. Another dialog box is displayed to monitor the status of the synchronization job.
You must specify the Hive databases to use in the synchronization either by using the bds-config.json configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.

4.5.3 Synchronizing Query Server Manually by Using the PL/SQL API

You can synchronize Query Server with the Hive databases that you specified by using the PL/SQL API.

To do so, invoke the dbms_bdsqs.sync_hive_databases PL/SQL API locally on the edge node where the Query Server is installed.

The procedure contains no parameters. It synchronizes all of the Hive databases that are already in Query Server. The API will refresh the Query Server with only the Hive databases listed in the sync_hive_db_list configuration parameter. Each successive synchronization (also known as a refresh) will process changes since the last Query Server metadata refresh.

A synchronization captures any tables that were added or dropped in the Hive metastore since the last refresh. This also includes any tables whose schemas might have changed.

4.5.4 Enabling Query Server Full Synchronization

You can specify whether Query Server performs a delta (default) or a full synchronization.

During the Query Server installation process, the Oracle schemas and the appropriate external tables are created based on the Hive databases list that you can specify either in the bds-config.json configuration file or the sync_hive_db_list configuration parameter. In that case, Query Server performs a full synchronization. By default, Query Server performs a delta synchronization during subsequent restarts or synchronizations.

You can control whether Query Server performs a full or a delta synchronization by using the Enable full synchronization configuration parameter in Cloudera Manager or Apache Ambari. This configuration parameter is de-selected by default. To enable Query Server to perform a full synchronization, select this checkbox in Cloudera Manager or Apache Ambari. For example, you can use Cloudera Manager to enable Query Server to perform a full synchronization during a restart or a manual synchronization as follows:

  1. Login to Cloudera Manager by using your login credentials.
  2. In Cloudera Manager, use the Search field to search for the Enable full synchronization configuration parameter. Enter / Enable full synchronization (or enter part of the name until it is displayed in the list) in the Search field, and then press Enter.
  3. Click Big Data SQL: Enable full synchronization. The checkbox is de-selected by default. This indicates that Query Sever will perform a delta synchronization.
  4. To enable full synchronization, select the checkbox, and then click Save Changes.

A full synchronization drops all of the existing schemas and external tables from Query Server, and then re-creates new schemas and new external tables based on the Hive databases list that you specified in the sync_hive_db_list configuration parameter.

By default, Query Server performs a delta synchronization between the Hive databases in the metastore that you specify and Query Server. Any changes in the Hive databases such as dropping or adding tables will be reflected in Query Server. However, When you start Query Server for the very first time, it will create Oracle Schemas based on the Hive databases that you specify either in the bds-config.json configuration file or in sync_hive_db_list configuration parameter in Cloudera Manager or Apache Ambari.

The first time the Query Server synchronizes with Hive the process will be slower than usual. This is because it is importing all of the tables for the specified databases (configured in Cloudera Manager or Apache Ambari) in the Hadoop cluster. Subsequent refreshes should be much faster as it would only refresh the changes that were made to the Hive Metadata such as additions of new tables. During a delta import, the Query Server will also gather new statistics for tables that have been added/modified.

4.6 Query Server Restarts and Metadata Persistence

You can refresh the Query Server metadata using Cloudera Manager, Apache Ambari, or a PL/SQL API.

The following key metadata can be saved so that they can be restored after a Query Server restart:

  • Table statistics

    Gathering statistics can be an expensive operation. Table Statistics are gathered automatically after each metadata synchronization. Subsequent statistics gathering may be captured using the following PL/SQL package procedure and parameters:

    DBMS_STATS.GATHER_TABLE_STATS (ownname => <schema>,
    tabname => <table-name>, estimate_percent => dbms_stats.auto_sample_size);
    

    Note:

    Make sure you use the estimate_percent=> dbms_stats.auto_sample_size parameter.
  • Hive external tables that use the ORACLE_HIVE access driver.
  • HDFS external tables that use the ORACLE_HDFS access driver.
  • User-defined views.

4.7 Query Server Security

You can connect to Query Server using the single-user or multi-user modes.

Query Server users connect to a pluggable database called BDSQLUSR. There are two ways to connect to the database, depending on whether the cluster is secure (by means of Kerberos) or non-secure.

Connecting to the Query Server Database

To query data in the Hadoop cluster, users can connect to the Query Server if it is installed. During the Big Data SQL installation, the Big Data SQL installer creates and installs the BDSQLUSR Query Server database on the edge node that you specified. In addition, the installer also installs everything else that you need on that edge node to enable you to query data in the Hadoop cluster.

Query Server users can connect to the Query Server database using the single-user or multi-user modes.

Connecting to Query Server in a Single-User Mode

Query Server supports a single-user model for non-secured Hadoop clusters. With this model, all users connect to the BDSQLUSR Query Server database as user BDSQL with a password that the administrator chooses during the Query Server installation.

Queries run on the cluster with the oracle user permissions. This means that the oracle user must be authorized to access the underlying Hadoop data – either by using Sentry privileges and/or HDFS authorizations. For example, you can connect to the Query Server database using SQL*Plus as follows:

sqlplus BDSQL/<bdsql_password>@BDSQLUSR 

Note:

Substitute <bdsql_password> in the above command with the actual BDSQL password that the administrator specified during the Oracle Big Data SQL installation.

Changing the BDSQL User Password

When installing Oracle Big Data SQL Query Server on a non-secure cluster, you can change the password of the BDSQL user with ALTER USER as follows:
sqlplus / as sysdba
sql> alter user <user> identified by '<newpassword>' replace '<currentpassword>';

You can also change the password for the current session only:

# su - oracle
sqlplus / as sysdba
sql> alter session set container=bdsqlusr;
sql> alter user bdsql identified by "<new_password>";

Note:

Substitute <new_password> with the new password. The new password must conform to the required Oracle secure password guidelines. See Choosing a Secure Password for information about choosing your new password.

Note that on a Kerberos-secured cluster, the user BDSQL is disabled.

Connecting to Query Server in a Multi-User Mode

Query Server supports Kerberos-enabled Hadoop clusters. You use this mode when there are multiple externally identified user accounts, corresponding to Kerberos principals. Connected users’ identities (Kerberos principals) will be used for authorization on the Hadoop cluster when the impersonation_enabled parameter is set to true in the bds-config.json configuration file. If this parameter is set to false, then authorization on the Hadoop cluster will be performed as user oracle. Sentry is used on Cloudera clusters. HDFS authorization is used for Hortonworks clusters. (Oracle Big Data SQL does not yet make use of Apache Ranger on Hortonworks HDP. )

Note:

See The Multi-User Authorization Model to learn how to use Hadoop Secure Impersonation to direct the oracle account to execute tasks on behalf of other designated users.
Hadoop queries executing on the cluster on behalf of a user will appear to the Hadoop nodes as the authenticated user and will have the corresponding permissions. The same is true for Hive queries and for statistics gathering operations. Before you can connect to Query Server, you must be authenticated with Kerberos using kinit. When you install or reconfigure Big Data Query Server on a secure cluster, Jaguar collects all principals from the Key Distribution Center (KDC) running on nodes where Hadoop DataNodes are also installed. For each principal, an externally identified user will be created on Big Data SQL Query Server. This install-time behavior is controlled by the syncPrincipals parameter in the bds-config.json configuration file. This operation can also be invoked by running the following command (notice that the spelling of the Jaguar operation is different):
jaguar sync_principals
You can also use the DBMS_BDSQS_ADMIN package which contains procedures to add and drop Query Server users. These Query Servers users are the same Kerberos principals that will be accessing your Hadoop cluster.
DBMS_BDSQS_ADMIN.ADD_KERBEROS_PRINCIPALS(principal_list varchar2, op_semantics varchar2 DEFAULT 'STOP_ON_FIRST_ERROR')

DBMS_BDSQS_ADMIN.DROP_KERBEROS_PRINCIPALS(principal_list varchar2, op_semantics varchar2 DEFAULT 'STOP_ON_FIRST_ERROR')  

Note:

Before you can run the procedures in the DBMS_BDSQS_ADMIN package, you must connect to Oracle Big Data SQL Query Server as user sys using OS authentication. For example, you can login to SQL*Plus as OS user oracle. See Oracle Big Data SQL Reference
Users in a multi-user mode can then connect to SQL*Plus without providing a password as follows:
[user_name@cluster_name ~]$ kinit user_name 
Password for user_name@cluster_name.US.ORACLE.COM: 
[user_name@cluster_name ~]$ sqlplus /@BDSQLUSR

SQL*Plus: Release 18.0.0.0.0 - Production on Tue Oct 2 13:54:39 2018
Version 18.3.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Tue Oct 02 2018 13:54:20 -05:00

Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.3.0.0.0

SQL>

Note:

In the above example, user_name and cluster_name reflects your actual username and cluster name.