4 Working With Query Server
Learn how to manage Query Server with Oracle Big Data SQL.
4.1 About Oracle Big Data SQL Query Server
Oracle Big Data SQL Query Server is an Oracle Database instance that you can optionally install as a component of Oracle Big Data SQL on an edge node in your Hadoop cluster. You use Query Server to primarily query data stored in the cluster (in HDFS and Hive formats) using Oracle external tables. This enables you to take advantage of the full SQL capabilities provided by the Oracle Database.
You can define external tables using the ORACLE_HDFS
and ORACLE_HIVE
or ORACLE_BIGDATA
access drivers or have the Query Server automatically define external tables based on the metadata in the Hive metastore. In the latter case, Hive databases map to Oracle Database schemas – and the corresponding Hive tables are defined as Oracle external tables in those schemas. All data authorization is based on authorization rules in Hadoop such as Apache Sentry or HDFS Access Controls Lists (ACLs).
Once installed, Query Server provides an Oracle Database deployment that is automatically configured to query data in your Hadoop cluster using SQL. Restarting the Query Server restores the database to a “clean” state, eliminating management overhead. A restart preserves external tables (ORACLE_HIVE, ORACLE_HDFS, and ORACLE_BIGDATA types), associated statistics, user defined views, and credentials. A restart deletes regular tables containing user data.
If your solution requires High Availability (HA), advanced Oracle security policies, or combining data in Oracle Database with data in Hadoop, then you should install a full-blown Oracle Database with Big Data SQL. Oracle supports using both Query Server and a Big Data SQL enabled Oracle Database for a single Hadoop deployment.
To install Query Server, you must specify an existing edge node in your Hadoop cluster in the bds-config.json
configuration file. You use the same configuration file to specify a list of Hive databases. Query Server automatically creates Oracle external tables corresponding to the tables in the Hive metastore database(s) so that they are ready for querying after a successful installation. The set of external tables in the Query Server can be automatically kept up-to-date with the corresponding Hive metastore tables by running either the Restart this Big Data SQL Query Server or the Synchronize Hive Databases commands in Cloudera Manager or Apache Ambari cluster management software. You can also use the dbms_bdsqs.sync_hive_databases
PL/SQL API package procedure.
See Also:
See Introduction in the Oracle Big Data SQL Installation Guide, which describes how to install and configure the software on the two sides of an Oracle Big Data SQL configuration.
See Storing Oracle Tablespaces in HDFS for instructions on how to set up data files for smart scanning.
4.2 Important Terms and Concepts
Introduction to edge nodes, edge database, cell nodes, and Hadoop cluster integration.
These terms are key to understanding Query Server.
About Edge Nodes
An edge node in a Hadoop cluster is the interface between the Hadoop cluster and the outside network. Typically, edge nodes are used to run client applications and Hadoop cluster administration tools such as Cloudera Manager and Apache Ambari. Edge nodes can act as a data gateway, by providing HDFS access through NFS or HttpFS, or by running REST servers.
About Cell Nodes
The BDS cells run on the DataNodes, and allows for parts of query processing to be pushed down to the Hadoop cluster DataNodes where the data resides. This ensures both load distribution and reduction in the volume of data that needs to be sent to the database for processing. This can result in significant performance improvements on Big Data workloads.
Hadoop Cluster Integration
Oracle Big Data SQL includes the following three service roles that can you can manage in either Cloudera Manager or Apache Ambari:
- Big Data SQL Query Server: Enables you to run SQL queries against the Hadoop cluster. Applications connect to this server using JDBC or SQL*Net.
- Big Data SQL Agent: Manages the Big Data SQL installation and is also used by the Copy to Hadoop feature.
- Big Data SQL Server: Also known as Big Data SQL Cells, allows for parts of query processing to get pushed-down to the Hadoop cluster DataNodes where the data resides.
4.3 Query Server Features
Big Data SQL Query Server provides automatic installation and configuration, integration with Hadoop cluster managers, and automatic integration of cluster metadata:
- Automatic installation and configuration: Oracle Big Data SQL installer
automatically installs and configures Query Server, if you specify
an existing target edge node in the Hadoop cluster in the
bds-config.json
configuration file. To specify the edge node where to install Query Server, you add theedgedb
parameter and the node and enabled attributes to thebds-config.json
configuration file to as shown in the following example where<edgenode_host_name>
is the name of your edge node:"edgedb": { "node" : "dbnode.domain.com", "enabled" : "true", "sync_hive_db_list" : "my_hive_db_1,my_hive_db2" }
Note:
If thebds-config.json
configuration file does not include the edgedb subsection, then Query Server is not installed.See Also:
The bds-config.json Configuration Example in the installation guide shows a fully-populatedbds-config.json
file. The example includes all available configuration parameters. - Integration with Hadoop cluster managers: You can monitor and manage Query Server as a service using Cloudera Manager or Apache Ambari Hadoop cluster management tools.
- Synchronization with Hive: When you start the Oracle Big Data service, Query Server automatically refreshes its metadata from the Hive metastore. After the initial refresh, users can synchronize the Query Server with the latest metadata in the Hive metastore.
4.4 Specifying the Hive Databases to Synchronize With Query Server
Before you can synchronize Query Server with the desired Hive databases in the metastore, you have to specify the list of Hive databases.
- During installation, specify the
sync_hive_db_list
parameter in thebds-config.json
configuration file. - After installation, you can update the
sync_hive_db_list
configuration parameter in Cloudera Manager or Apache Ambari.
After installing Query Server, it automatically creates schemas and external tables based on the Hive metastore databases list that you specified. Every subsequent Query Server restart will perform a delta synchronization.
4.4.1 Specifying the Hive Databases in the bds-config.json Configuration File
You can provide the initial list of Hive databases to synchronize with Query Server as part of the installation process using the bds-config.json
configuration file.
In the configuration file, include the sync_hive_db_list
configuration parameter followed by a list of the Hive databases. The following example specifies two Hive databases for the sync_hive_db_list
configuration parameter: htdb0
and htdb1
. Only these two databases will be synchronized with Query Server, even if the Hive metastore contains other databases.
"edgedb": {
"node": "<edgenode_host_name>
",
"enabled": "true",
"sync_hive_db_list": "htdb0,htdb1"
. . .
}
To synchronize all Hive databases in the metastore with Query Server, use the "*" wildcard character as follows:
"edgedb": {
"node": "EdgeNode_Host_Name",
"enabled": "true"
"sync_hive_db_list": "*"
. . .
}
If the bds-config.json
configuration file does not contain the sync_hive_db_list
configuration parameter, then no synchronization will take place between the Hive databases and Query Server. In that case, you must specify the Hive databases using the sync_hive_db_list
configuration parameter in Cloudera Manager or Apache Ambari.
Note:
Query Server is not intended to store internal data in Oracle tables. Whenever the Query Server is re-started, it is "reset" to its initial and clean state. This eliminates typical database maintenance such as storage management, database configuration, and so on. The goal of Query Server is to provide a SQL front-end for data in Hadoop, Object Store, Kafka, and NoSQL databases and not a general-purpose RDBMS.4.4.2 Updating the Hive Databases With the sync_hive_db_list Configuration Parameter
You can update the list of the Hive databases to synchronize with Query Server by using Cloudera Manager.
sync_hive_db_list
configuration parameter in Cloudera Manager as follows:
4.5 Synchronizing Query Server With Hive
You can synchronize the Query Server with the Hive databases that you specified by using Cloudera Manager, Apache Ambari, or the dbms_bdsqs.sync_hive_databases
PL/SQL API.
- Execute the Restart this Big Data SQL Query Server command in Cloudera Manager or Apache Ambari.
- Execute the Synchronize Hive Databases command in Cloudera Manager or Apache Ambari.
- Invoke the
dbms_bdsqs.sync_hive_databases
PL/SQL API locally on the edge node.
bds-config.json
configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.
Note that the dbms_bdsqs.sync_hive_databases
PL/SQL API will only refresh the Hive table definitions for the Hive databases that have already been synchronized through the other two methods.
4.5.1 Restarting Query Server Manually by Using Cloudera Manager
You can synchronize Query Server with the Hive databases that you specified by restarting Query Server in Cloudera Manager or Apache Ambari.
bds-config.json
configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.
4.5.2 Synchronizing Query Server Manually by Using Cloudera Manager
You can use Cloudera Manager or Apache Ambari to manually synchronize Query Server with the Hive databases that you specified.
bds-config.json
configuration file or by using the sync_hive_db_list configuration parameter in Cloudera Manager.
4.5.3 Synchronizing Query Server Manually by Using the PL/SQL API
You can synchronize Query Server with the Hive databases that you specified by using the PL/SQL API.
To do so, invoke the dbms_bdsqs.sync_hive_databases
PL/SQL API locally on the edge node where the Query Server is installed.
The procedure contains no parameters. It synchronizes all of the Hive databases that are already in Query Server. The API will refresh the Query Server with only the Hive databases listed in the sync_hive_db_list
configuration parameter. Each successive synchronization (also known as a refresh) will process changes since the last Query Server metadata refresh.
A synchronization captures any tables that were added or dropped in the Hive metastore since the last refresh. This also includes any tables whose schemas might have changed.
4.5.4 Enabling Query Server Full Synchronization
You can specify whether Query Server performs a delta (default) or a full synchronization.
During the Query Server installation process, the Oracle schemas and the appropriate external tables are created based on the Hive databases list that you can specify either in the bds-config.json
configuration file or the sync_hive_db_list
configuration parameter. In that case, Query Server performs a full synchronization. By default, Query Server performs a delta synchronization during subsequent restarts or synchronizations.
You can control whether Query Server performs a full or a delta synchronization by using the Enable full synchronization configuration parameter in Cloudera Manager or Apache Ambari. This configuration parameter is de-selected by default. To enable Query Server to perform a full synchronization, select this checkbox in Cloudera Manager or Apache Ambari. For example, you can use Cloudera Manager to enable Query Server to perform a full synchronization during a restart or a manual synchronization as follows:
- Login to Cloudera Manager by using your login credentials.
- In Cloudera Manager, use the Search field to search for the Enable full synchronization configuration parameter. Enter / Enable full synchronization (or enter part of the name until it is displayed in the list) in the Search field, and then press Enter.
- Click Big Data SQL: Enable full synchronization. The checkbox is de-selected by default. This indicates that Query Sever will perform a delta synchronization.
- To enable full synchronization, select the checkbox, and then click Save Changes.
A full synchronization drops all of the existing schemas and external tables from Query Server, and then re-creates new schemas and new external tables based on the Hive databases list that you specified in the sync_hive_db_list
configuration parameter.
By default, Query Server performs a delta synchronization between the Hive databases in the metastore that you specify and Query Server. Any changes in the Hive databases such as dropping or adding tables will be reflected in Query Server. However, When you start Query Server for the very first time, it will create Oracle Schemas based on the Hive databases that you specify either in the bds-config.json
configuration file or in sync_hive_db_list
configuration parameter in Cloudera Manager or Apache Ambari.
The first time the Query Server synchronizes with Hive the process will be slower than usual. This is because it is importing all of the tables for the specified databases (configured in Cloudera Manager or Apache Ambari) in the Hadoop cluster. Subsequent refreshes should be much faster as it would only refresh the changes that were made to the Hive Metadata such as additions of new tables. During a delta import, the Query Server will also gather new statistics for tables that have been added/modified.
4.6 Query Server Restarts and Metadata Persistence
You can refresh the Query Server metadata using Cloudera Manager, Apache Ambari, or a PL/SQL API.
The following key metadata can be saved so that they can be restored after a Query Server restart:
- Table statistics
Gathering statistics can be an expensive operation. Table Statistics are gathered automatically after each metadata synchronization. Subsequent statistics gathering may be captured using the following PL/SQL package procedure and parameters:
DBMS_STATS.GATHER_TABLE_STATS (ownname => <schema>, tabname => <table-name>, estimate_percent => dbms_stats.auto_sample_size);
Note:
Make sure you use theestimate_percent=> dbms_stats.auto_sample_size
parameter. - Hive external tables that use the
ORACLE_HIVE
access driver. - HDFS external tables that use the
ORACLE_HDFS
access driver. - User-defined views.
4.7 Query Server Security
You can connect to Query Server using the single-user or multi-user modes.
Query Server users connect to a pluggable database called BDSQLUSR. There are two ways to connect to the database, depending on whether the cluster is secure (by means of Kerberos) or non-secure.
Connecting to the Query Server Database
To query data in the Hadoop cluster, users can connect to the Query Server if it is installed. During the Big Data SQL installation, the Big Data SQL installer creates and installs the BDSQLUSR
Query Server database on the edge node that you specified. In addition, the installer also installs everything else that you need on that edge node to enable you to query data in the Hadoop cluster.
Query Server users can connect to the Query Server database using the single-user or multi-user modes.
Connecting to Query Server in a Single-User Mode
Query Server supports a single-user model for non-secured Hadoop clusters. With this model, all users connect to the BDSQLUSR
Query Server database as user BDSQL
with a password that the administrator chooses during the Query Server installation.
Queries run on the cluster with the oracle
user permissions. This means that the oracle
user must be authorized to access the underlying Hadoop data – either by using Sentry privileges and/or HDFS authorizations. For example, you can connect to the Query Server database using SQL*Plus as follows:
sqlplus BDSQL/<bdsql_password>@BDSQLUSR
Note:
Substitute<bdsql_password>
in the above command with the actual BDSQL
password that the administrator specified during the Oracle Big Data SQL installation.
Changing the BDSQL User Password
BDSQL
user with ALTER USER
as follows:sqlplus / as sysdba
sql> alter user <user> identified by '<newpassword>' replace '<currentpassword>';
You can also change the password for the current session only:
# su - oracle
sqlplus / as sysdba
sql> alter session set container=bdsqlusr;
sql> alter user bdsql identified by "<new_password>";
Note:
Substitute<new_password>
with the new password. The new password must conform to the required Oracle secure password guidelines. See Choosing a Secure Password for information about choosing your new password.
Note that on a Kerberos-secured cluster, the user BDSQL
is disabled.
Connecting to Query Server in a Multi-User Mode
Query Server supports Kerberos-enabled Hadoop clusters. You use this mode when there are multiple externally identified user accounts, corresponding to Kerberos principals. Connected users’ identities (Kerberos principals) will be used for authorization on the Hadoop cluster when the impersonation_enabled
parameter is set to true
in the bds-config.json
configuration file. If this parameter is set to false
, then authorization on the Hadoop cluster will be performed as user oracle
. Sentry is used on Cloudera clusters. HDFS authorization is used for Hortonworks clusters. (Oracle Big Data SQL does not yet make use of Apache Ranger on Hortonworks HDP. )
Note:
See The Multi-User Authorization Model to learn how to use Hadoop Secure Impersonation to direct the oracle account to execute tasks on behalf of other designated users.kinit
. When you install or reconfigure Big Data Query Server on a secure cluster, Jaguar collects all principals from the Key Distribution Center (KDC) running on nodes where Hadoop DataNodes are also installed. For each principal, an externally identified user will be created on Big Data SQL Query Server. This install-time behavior is controlled by the syncPrincipals
parameter in the bds-config.json
configuration file. This operation can also be invoked by running the following command (notice that the spelling of the Jaguar operation is different): jaguar sync_principals
DBMS_BDSQS_ADMIN
package which contains procedures to add and drop Query Server users. These Query Servers users are the same Kerberos principals that will be accessing your Hadoop cluster. DBMS_BDSQS_ADMIN.ADD_KERBEROS_PRINCIPALS(principal_list varchar2, op_semantics varchar2 DEFAULT 'STOP_ON_FIRST_ERROR')
DBMS_BDSQS_ADMIN.DROP_KERBEROS_PRINCIPALS(principal_list varchar2, op_semantics varchar2 DEFAULT 'STOP_ON_FIRST_ERROR')
Note:
Before you can run the procedures in theDBMS_BDSQS_ADMIN
package, you must connect to Oracle Big Data SQL Query Server as user sys
using OS authentication. For example, you can login to SQL*Plus as OS user oracle
. See Oracle Big Data SQL Reference
[user_name@cluster_name ~]$ kinit user_name
Password for user_name@cluster_name.US.ORACLE.COM:
[user_name@cluster_name ~]$ sqlplus /@BDSQLUSR
SQL*Plus: Release 18.0.0.0.0 - Production on Tue Oct 2 13:54:39 2018
Version 18.3.0.0.0
Copyright (c) 1982, 2018, Oracle. All rights reserved.
Last Successful login time: Tue Oct 02 2018 13:54:20 -05:00
Connected to:
Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
Version 18.3.0.0.0
SQL>
Note:
In the above example,user_name
and cluster_name
reflects your actual username and cluster name.