3 Administering Oracle Big Data Appliance
This chapter provides information about the software and services installed on Oracle Big Data Appliance. It contains these sections:
3.1 Monitoring Multiple Clusters Using Oracle Enterprise Manager
An Oracle Enterprise Manager plug-in enables you to use the same system monitoring tool for Oracle Big Data Appliance as you use for Oracle Exadata Database Machine or any other Oracle Database installation. With the plug-in, you can view the status of the installed software components in tabular or graphic presentations, and start and stop these software services. You can also monitor the health of the network and the rack components.
Oracle Enterprise Manager enables you to monitor all Oracle Big Data Appliance racks on the same InfiniBand fabric. It provides summary views of both the rack hardware and the software layout of the logical clusters.
Note:
Before you start, contact Oracle Support for up-to-date information about Enterprise Manager plug-in functionality.3.1.1 Using the Enterprise Manager Web Interface
After opening Oracle Enterprise Manager web interface, logging in, and selecting a target cluster, you can drill down into these primary areas:
-
InfiniBand network: Network topology and status for InfiniBand switches and ports. See Figure 3-1.
-
Hadoop cluster: Software services for HDFS, MapReduce, and ZooKeeper.
-
Oracle Big Data Appliance rack: Hardware status including server hosts, Oracle Integrated Lights Out Manager (Oracle ILOM) servers, power distribution units (PDUs), and the Ethernet switch.
The following figure shows a small section of the cluster home page.
Figure 3-1 YARN Page in Oracle Enterprise Manager
Description of "Figure 3-1 YARN Page in Oracle Enterprise Manager"
To monitor Oracle Big Data Appliance using Oracle Enterprise Manager:
-
Download and install the plug-in. See Oracle Enterprise Manager System Monitoring Plug-in Installation Guide for Oracle Big Data Appliance.
-
Log in to Oracle Enterprise Manager as a privileged user.
-
From the Targets menu, choose Big Data Appliance to view the Big Data page. You can see the overall status of the targets already discovered by Oracle Enterprise Manager.
-
Select a target cluster to view its detail pages.
-
Expand the target navigation tree to display the components. Information is available at all levels.
-
Select a component in the tree to display its home page.
-
To change the display, choose an item from the drop-down menu at the top left of the main display area.
3.1.2 Using the Enterprise Manager Command-Line Interface
3.2 Managing Operations Using Cloudera Manager
Cloudera Manager is installed on Oracle Big Data Appliance to help you with Cloudera's Distribution including Apache Hadoop (CDH) operations. Cloudera Manager provides a single administrative interface to all Oracle Big Data Appliance servers configured as part of the Hadoop cluster.
Cloudera Manager simplifies the performance of these administrative tasks:
-
Monitor jobs and services
-
Start and stop services
-
Manage security and Kerberos credentials
-
Monitor user activity
-
Monitor the health of the system
-
Monitor performance metrics
-
Track hardware use (disk, CPU, and RAM)
Cloudera Manager runs on the ResourceManager node (node03) and is available on port 7180.
To use Cloudera Manager:
-
Open a browser and enter a URL like the following:
In this example,
bda1
is the name of the appliance,node03
is the name of the server,example.com
is the domain, and7180
is the default port number for Cloudera Manager. -
Log in with a user name and password for Cloudera Manager. Only a user with administrative privileges can change the settings. Other Cloudera Manager users can view the status of Oracle Big Data Appliance.
See Also:
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_dg_about.html provides information on Cloudera monitoring and diagnostics.
3.2.1 Monitoring the Status of Oracle Big Data Appliance
In Cloudera Manager, you can choose any of the following pages from the menu bar across the top of the display:
-
Home: Provides a graphic overview of activities and links to all services controlled by Cloudera Manager. See the following figure.
-
Clusters: Accesses the services on multiple clusters.
-
Hosts: Monitors the health, disk usage, load, physical memory, swap space, and other statistics for all servers in the cluster.
-
Diagnostics: Accesses events and logs. Cloudera Manager collects historical information about the systems and services. You can search for a particular phrase for a selected server, service, and time period. You can also select the minimum severity level of the logged messages included in the search: TRACE, DEBUG, INFO, WARN, ERROR, or FATAL.
-
Audits: Displays the audit history log for a selected time range. You can filter the results by user name, service, or other criteria, and download the log as a CSV file.
-
Charts: Enables you to view metrics from the Cloudera Manager time-series data store in a variety of chart types, such as line and bar.
-
Backup: Accesses snapshot policies and scheduled replications.
-
Administration: Provides a variety of administrative options, including Settings, Alerts, Users, and Kerberos.
Rhe following figure shows the Cloudera Manager home page.
3.2.2 Performing Administrative Tasks
As a Cloudera Manager administrator, you can change various properties for monitoring the health and use of Oracle Big Data Appliance, add users, and set up Kerberos security.
To access Cloudera Manager Administration:
-
Log in to Cloudera Manager with administrative privileges.
-
Click Administration, and select a task from the menu.
3.2.3 Managing CDH Services With Cloudera Manager
Cloudera Manager provides the interface for managing these services:
-
HDFS
-
Hive
-
Hue
-
Oozie
-
YARN
-
ZooKeeper
You can use Cloudera Manager to change the configuration of these services, stop, and restart them. Additional services are also available, which require configuration before you can use them. See "Unconfigured Software."
Note:
Manual edits to Linux service scripts or Hadoop configuration files do not affect these services. You must manage and configure them using Cloudera Manager.
3.3 Using Hadoop Monitoring Utilities
You also have the option of using the native Hadoop utilities. These utilities are read-only and do not require authentication.
Cloudera Manager provides an easy way to obtain the correct URLs for these utilities. On the YARN service page, expand the Web UI submenu.
3.3.1 Monitoring MapReduce Jobs
You can monitor MapReduce jobs using the resource manager interface.
To monitor MapReduce jobs:
-
Open a browser and enter a URL like the following:
http://bda1node03.example.com:8088
In this example,
bda1
is the name of the rack,node03
is the name of the server where the YARN resource manager runs, and8088
is the default port number for the user interface.
The following figure shows the resource manager interface.
Figure 3-3 YARN Resource Manager Interface
Description of "Figure 3-3 YARN Resource Manager Interface"
3.3.2 Monitoring the Health of HDFS
You can monitor the health of the Hadoop file system by using the DFS health utility on the first two nodes of a cluster.
To monitor HDFS:
-
Open a browser and enter a URL like the following:
http://bda1node01.example.com:50070
In this example,
bda1
is the name of the rack,node01
is the name of the server where the dfshealth utility runs, and50070
is the default port number for the user interface.
Figure 3-3 shows the DFS health utility interface.
3.4 Using Cloudera Hue to Interact With Hadoop
Hue runs in a browser and provides an easy-to-use interface to several applications to support interaction with Hadoop and HDFS. You can use Hue to perform any of the following tasks:
-
Query Hive data stores
-
Create, load, and delete Hive tables
-
Work with HDFS files and directories
-
Create, submit, and monitor MapReduce jobs
-
Monitor MapReduce jobs
-
Create, edit, and submit workflows using the Oozie dashboard
-
Manage users and groups
Hue is automatically installed and configured on Oracle Big Data Appliance. It runs on port 8888 of the ResourceManager node. See the tables in About CDH Clusters for Hue’s location within different cluster configurations.
To use Hue:
-
Log in to Cloudera Manager and click the hue service on the Home page.
-
On the hue page under Quick Links, click
Hue Web UI
. -
Bookmark the Hue URL, so that you can open Hue directly in your browser. The following URL is an example:
http://bda1node03.example.com:8888
-
Log in with your Hue credentials.
If Hue accounts have not been created yet, log into the default Hue administrator account by using the following credentials:
-
Username:
admin
-
Password:
cm-admin-password
where
cm-admin-password
is the password specified when the cluster for the Cloudera Manager admin user was activated. You can then create other user and administrator accounts. -
3.5 About the Oracle Big Data Appliance Software
The following sections identify the software installed on Oracle Big Data Appliance.
This section contains the following topics:
3.5.1 Unconfigured Software
Your Oracle Big Data Appliance license includes all components in Cloudera Enterprise Data Hub Edition. All CDH components are installed automatically by the Mammoth utility. Do not download them from the Cloudera website.
However, you must use Cloudera Manager to add some services before you can use them, such as the following:
To add a service:
-
Log in to Cloudera Manager as the
admin
user. -
On the Home page, expand the cluster menu in the left panel and choose Add a Service to open the Add Service wizard. The first page lists the services you can add.
-
Follow the steps of the wizard.
See Also:
-
For a list of key CDH components:
http://www.cloudera.com/content/www/en-us/products/apache-hadoop/key-cdh-components.html
3.5.2 Allocating Resources Among Services
You can allocate resources to each service—HDFS, YARN, Hive, and so forth—as a percentage of the total resource pool. Cloudera Manager automatically calculates the recommended resource management settings based on these percentages. The static service pools isolate services on the cluster, so that a high load on one service as a limited impact on the other services.
To allocate resources among services:
-
Log in as
admin
to Cloudera Manager. -
Open the Clusters menu at the top of the page, then select Static Service Pools under Resource Management.
-
Select Configuration.
-
Follow the steps of the wizard, or click Change Settings Directly to edit the current settings.
3.6 About CDH Clusters
There are slight variations in the location of the services within a cluster, depending on the configuration of the cluster.
Note that in general decommissioning or removing roles that were deployed by the Mammoth installer is not supported. In some cases this may be acceptable for slave roles. However, the role must be completely removed from Cloudera Manager. Also, removal of a role may result in lower performance or reduced storage.
This section contains the following topics:
3.6.1 Services on a Three-Node Development Cluster
Oracle Big Data Appliance enables the use of three-node clusters for development purposes.
Caution:
Three-node clusters are generally not suitable for production environments because all of the nodes are master nodes. This puts constraints on high availability. The minimum recommended cluster size for a production environment is five nodesTable 3-1 Service Locations for a Three-Node Development Cluster
Node1 | Node2 | Node3 |
---|---|---|
NameNode | NameNode/Failover | - |
Failover Controller | Failover Controller | - |
DataNode | DataNode | DataNode |
NodeManager | NodeManager | NodeManager |
JournalNode | JournalNode | JournalNode |
- | HttpFS | Cloudera Manager and CM roles |
- | MySQL Backup | MySQL Primary |
ResourceManager | - | ResourceManager |
- | - | JobHistory |
- | ODI | Spark History |
- | Oozie | - |
Hue Server | Hue Server | - |
Hue Load Balancer | Hue Load Balancer | - |
ZooKeeper | ZooKeeper | ZooKeeper |
Active Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) | Passive Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) | - |
Kerberos Master KDC (Only if MIT Kerberos is enabled and on-BDA KDCs are being used.) | Kerberos Slave KDC (Only if MIT Kerberos is enabled and on-BDA KDCs are being used.) | - |
Sentry Server (if enabled) | Sentry Server (if enabled) | - |
Hive Metastore | Hive Metastore | - |
- | WebHCat | - |
3.6.2 Service Locations on Rack 1 of a CDH Cluster with Four or More Nodes
As of Release 5.1, the distribution of services within a multirack cluster has changed for new installations of Oracle Big Data Appliance. All four master nodes (and the services they host) are now located in the first rack of a cluster. In earlier releases, some critical services are hosted on the second rack of a multirack cluster.
Note:
Note that clusters across multiple racks which are upgraded to Release 5.1 from older versions will retain their current multiple-rack layout, in which some critical services are hosted on the second rack.
The table below identifies the services on the first rack of CDH cluster. Node1 is the first server in the cluster and nodenn is the last server in the cluster. This service layout is the same for a single rack cluster and the first rack of a multirack cluster.
Table 3-2 Service Locations in the First Rack of a Cluster
Node1 | Node2 | Node3 | Node4 | Node5 to nn |
---|---|---|---|---|
Balancer |
- |
Cloudera Manager Server |
- |
- |
Cloudera Manager Agent |
Cloudera Manager Agent |
Cloudera Manager Agent |
Cloudera Manager Agent |
Cloudera Manager Agent |
DataNode |
DataNode |
DataNode |
DataNode |
DataNode |
Failover Controller |
Failover Controller |
Big Data Manager (including BDM-proxy and BDM-notebook) |
Oozie |
- |
JournalNode |
JournalNode |
JournalNode |
- |
- |
- |
MySQL Backup |
MySQL Primary |
- |
- |
NameNode |
NameNode |
Navigator Audit Server and Navigator Metadata Server |
- |
- |
NodeManager (in clusters of eight nodes or less) |
NodeManager (in clusters of eight nodes or less) |
NodeManager |
NodeManager |
NodeManager |
Sentry Server (if enabled) |
Sentry Server (if enabled) |
SparkHistoryServer |
Oracle Data Integrator Agent |
- |
Hive Metastore |
HttpFS |
- |
Hive Metastore and HiveServer2 |
- |
ZooKeeper |
ZooKeeper |
ZooKeeper |
Hive WEbHCat Server |
- |
Hue Server |
- |
JobHistory |
Hue Server |
- |
Hue Load Balancer |
- |
Hue Load Balancer |
- |
- |
Active Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) |
Passive Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) |
ResourceManager |
ResourceManager |
- |
Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) |
Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) |
- |
- |
- |
If Kerberos is enabled:
Hue Kerberos Ticket Renewer, Key Trustee KMS Key Management Server Proxy, Key Trustee Server Active Database |
If Kerberos is enabled:
Key Trustee KMS Key Management Server Proxy, Key Trustee Server Passive Database |
- | If Kerberos is enabled:
Hue Kerberos Ticket Renewer |
- |
Note:
If Oozie high availability is enabled, then Oozie servers are hosted on Node4 and another node (preferably a ResourceNode) selected by the customer.
3.6.3 Service Locations on Additional Racks of a Cluster
Note:
This layout has changed from previous releases. All critical services now run on the first rack of the cluster.There is one variant that is determined specifically by cluster size – for clusters of eight nodes less, nodes that run NameNode also run NodeManager. This is not true for clusters larger than eight nodes.
The services running on all nodes of rack 2 and additional racks are the same as those running on node 5 and above on rack 1:
- Cloudera Manager Agent
- DataNode
- NodeManager (if cluster includes eight nodes or less)
3.6.4 About MapReduce
Yet Another Resource Negotiator (YARN) is the version of MapReduce that runs on Oracle Big Data Appliance. MapReduce applications developed using MapReduce 1 (MRv1) may require recompilation to run under YARN.
The ResourceManager performs all resource management tasks. An MRAppMaster performs the job management tasks. Each job has its own MRAppMaster. The NodeManager has containers that can run a map task, a reduce task, or an MRAppMaster. The NodeManager can dynamically allocate containers using the available memory. This architecture results in improved scalability and better use of the cluster than MRv1.
YARN also manages resources for Spark and Impala.
See Also:
"Running Existing Applications on Hadoop 2 YARN" at
http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/
3.6.5 Automatic Failover of the NameNode
The NameNode is the most critical process because it keeps track of the location of all data. Without a healthy NameNode, the entire cluster fails. Apache Hadoop v0.20.2 and earlier are vulnerable to failure because they have a single name node.
The current version of Cloudera's Distribution including Apache Hadoop in Oracle Big Data Appliance reduces this vulnerability by maintaining redundant NameNodes. The data is replicated during normal operation as follows:
-
CDH maintains redundant NameNodes on the first two nodes of a cluster. One of the NameNodes is in active mode, and the other NameNode is in hot standby mode. If the active NameNode fails, then the role of active NameNode automatically fails over to the standby NameNode.
-
The NameNode data is written to a mirrored partition so that the loss of a single disk can be tolerated. This mirroring is done at the factory as part of the operating system installation.
-
The active NameNode records all changes to the file system metadata in at least two JournalNode processes, which the standby NameNode reads. There are three JournalNodes, which run on the first three nodes of each cluster.
-
The changes recorded in the journals are periodically consolidated into a single fsimage file in a process called checkpointing.
On Oracle Big Data Appliance, the default log level of the NameNode is DEBUG
, to support the Oracle Audit Vault and Database Firewall plugin. If this option is not configured, then you can reset the log level to INFO
.
Note:
Oracle Big Data Appliance 2.0 and later releases do not support the use of an external NFS filer for backups and do not use NameNode federation.
The following figure shows the relationships among the processes that support automatic failover of the NameNode.
Figure 3-6 Automatic Failover of the NameNode on Oracle Big Data Appliance
Description of "Figure 3-6 Automatic Failover of the NameNode on Oracle Big Data Appliance"
3.6.6 Automatic Failover of the ResourceManager
The ResourceManager allocates resources for application tasks and application masters across the cluster. Like the NameNode, the ResourceManager is a critical point of failure for the cluster. If all ResourceManagers fail, then all jobs stop running. Oracle Big Data Appliance supports ResourceManager High Availability in Cloudera 5 to reduce this vulnerability.
CDH maintains redundant ResourceManager services on node03 and node04. One of the services is in active mode, and the other service is in hot standby mode. If the active service fails, then the role of active ResourceManager automatically fails over to the standby service. No failover controllers are required.
The following figure shows the relationships among the processes that support automatic failover of the ResourceManager.
Figure 3-7 Automatic Failover of the ResourceManager on Oracle Big Data Appliance
Description of "Figure 3-7 Automatic Failover of the ResourceManager on Oracle Big Data Appliance"
3.6.7 Map and Reduce Resource Allocation
Oracle Big Data Appliance dynamically allocates memory to YARN. The allocation depends upon the total memory on the node and whether the node is one of the four critical nodes.
If you add memory, update the NodeManager container memory by increasing it by 80% of the memory added. Leave the remaining 20% for overhead.
3.7 About Oracle NoSQL Database Clusters
Oracle NoSQL Database clusters do not have critical nodes and because the storage nodes are replicated by a factor of three, the risk of critical failure is minimal. Administrative services are distributed among the nodes in number equal to the replication factor. You can use the Administration CLI and Admin console to administer the cluster from any node that hosts the administrative processes.
If the node hosting Mammoth fails (the first node of the cluster), then follow the procedure for reinstalling it in "Prerequisites for Managing a Failing Node"
To repair or replace any failing Oracle NoSQL node, follow the procedure in "Managing a Failing Noncritical Node".
3.8 Effects of Hardware on Software Availability
The effects of a server failure vary depending on the server's function within the CDH cluster. Oracle Big Data Appliance servers are more robust than commodity hardware, so you should experience fewer hardware failures. This section highlights the most important services that run on the various servers of the primary rack. For a full list, see "Service Locations on Rack 1 of a CDH Cluster with Four or More Nodes."
Note:
In a multirack cluster, some critical services run on the first server of the second rack. See "Service Locations on Additional Racks of a Cluster."
3.8.1 Logical Disk Layout
The layout of logical disk partitions for X8, X7, X6, X5, and X4 server models is shown below.
On all drive configurations, the operating system is installed on disks 1 and 2. These two disks are mirrored. They include the Linux operating system, all installed software, NameNode data, and MySQL Database data. The NameNode and MySQL Database data are replicated on the two servers for a total of four copies.
- In X7 – switchover to the use of EFI instead of BIOS and the replacement of USB with two SSDs
- In Oracle Linux 7 - swap partitioning is dropped.
- In X8 - a new partition is added
In the table below, note that Linux disk/partition device names (such as /dev/sdc
or /dev/sdc1
) are not stable. They are picked by the kernel at boot time, so they may change as disks are removed and re-added.
Table 3-3 Oracle Big Data Appliance Server Disk Partioning
Disks 1 and 2 (OS) | Disks 3 – 12 (Data) |
---|---|
14 TB Drives (Big Data Appliance X8) Disk
Disk
|
14 TB Drives (Big Data Appliance X8) Disks
|
10 TB Drives (Big Data Appliance X7) Disk
/dev/sdc :
Disk /dev/sdd:
|
10 TB Drives (Big Data Appliance X7) Disks
|
8 TB Drives (Big Data Appliance X6 and X5)
|
8 TB Drives (Big Data Appliance X6 and X5)
|
4 TB Drives (Big Data Appliance X4)
|
4 TB Drives (Big Data Appliance X4)
|
3.8.2 Critical and Noncritical CDH Nodes
Critical nodes are required for the cluster to operate normally and provide all services to users. In contrast, the cluster continues to operate with no loss of service when a noncritical node fails.
Critical services are installed initially on the first four nodes of the cluster (within the first rack in the case of multirack clusters). The remaining nodes (node05 up to node18) only run noncritical services. If a hardware failure occurs on one of the critical nodes, then the services can be moved to another, noncritical server. For example, if node02 fails, then you might move its critical services node05. Table 3-2 identifies the location of services on the first rack.
3.8.2.1 High Availability or Single Points of Failure?
Some services have high availability and automatic failover. Other services have a single point of failure. The following list summarizes the critical services:
-
NameNodes: High availability with automatic failover
-
ResourceManagers: High availability with automatic failover
-
MySQL Database: Primary and backup databases are configured with replication of the primary database to the backup database. There is no automatic failover. If the primary database fails, the functionality of the cluster is diminished, but no data is lost.
-
Cloudera Manager: The Cloudera Manager server runs on one node. If it fails, then Cloudera Manager functionality is unavailable.
-
Hue server, Hue load balancer, Sentry, Hive metastore: High availabiilty
-
Oozie server, Oracle Data Integrator agent: These services have no redundancy. If the node fails, then the services are unavailable.
3.8.2.2 Where Do the Critical Services Run?
Critical services are hosted as shown below. See Service Locations on Rack 1 of a CDH Cluster with Four or More Nodes for the complete list of services on each node.
Table 3-4 Critical Service Locations on a Single Rack
Node Name | Critical Functions |
---|---|
First NameNode |
Balancer, Failover Controller, JournalNode, NameNode, Puppet Master, ZooKeeper, Hue Server. |
Second NameNode |
Failover Controller, JournalNode, MySQL Backup Database, NameNode, ZooKeeper |
First ResourceManager Node |
Cloudera Manager Server, JobHistory, JournalNode, MySQL Primary Database, ResourceManager, ZooKeeper. |
Second ResourceManager Node |
Hive, Hue, Oozie, Solr, Oracle Data Integrator Agent, ResourceManager |
3.8.3 First NameNode Node
If the first NameNode fails or goes offline (such as a restart), then the second NameNode automatically takes over to maintain the normal activities of the cluster.
Alternatively, if the second NameNode is already active, it continues without a backup. With only one NameNode, the cluster is vulnerable to failure. The cluster has lost the redundancy needed for automatic failover.
The puppet master also runs on this node. The Mammoth utility uses Puppet, and so you cannot install or reinstall the software if, for example, you must replace a disk drive elsewhere in the rack.
3.8.4 Second NameNode Node
If the second NameNode fails, then the function of the NameNode either fails over to the first NameNode (node01) or continues there without a backup. However, the cluster has lost the redundancy needed for automatic failover if the first NameNode also fails.
The MySQL backup database also runs on this node. MySQL Database continues to run, although there is no backup of the master database.
3.8.5 First ResourceManager Node
If the first ResourceManager node fails or goes offline (such as in a restart of the server where the node is running), then the second ResourceManager automatically takes over the distribution of MapReduce tasks to specific nodes across the cluster.
If the second ResourceManager is already active when the first ResourceManager becomes inaccessible, then it continues as ResourceManager, but without a backup. With only one ResourceManager, the cluster is vulnerable because it has lost the redundancy needed for automatic failover.
If the first ResourceManager node fails or goes offline (such as a restart), then the second ResourceManager automatically takes over to distribute MapReduce tasks to specific nodes across the cluster.
Alternatively, if the second ResourceManager is already active, it continues without a backup. With only one ResourceManager, the cluster is vulnerable to failure. The cluster has lost the redundancy needed for automatic failover.
These services are also disrupted:
-
Cloudera Manager: This tool provides central management for the entire CDH cluster. Without this tool, you can still monitor activities using the utilities described in "Using Hadoop Monitoring Utilities".
-
MySQL Database: Cloudera Manager, Oracle Data Integrator, Hive, and Oozie use MySQL Database. The data is replicated automatically, but you cannot access it when the master database server is down.
3.8.6 Second ResourceManager Node
If the second ResourceManager node fails, then the function of the ResourceManager either fails over to the first ResourceManager or continues there without a backup. However, the cluster has lost the redundancy needed for automatic failover if the first ResourceManager also fails.
These services are also disrupted:
-
Oracle Data Integrator Agent This service supports Oracle Data Integrator, which is one of the Oracle Big Data Connectors. You cannot use Oracle Data Integrator when the ResourceManager node is down.
-
Hive: Hive provides a SQL-like interface to data that is stored in HDFS. Most of the Oracle Big Data Connectors can access Hive tables, which are not available if this node fails.
-
Hue: This administrative tool is not available when the ResourceManager node is down.
-
Oozie: This workflow and coordination service runs on the ResourceManager node, and is unavailable when the node is down.
3.8.7 Noncritical CDH Nodes
The noncritical nodes are optional in that Oracle Big Data Appliance continues to operate with no loss of service if a failure occurs. The NameNode automatically replicates the lost data to always maintain three copies. MapReduce jobs execute on copies of the data stored elsewhere in the cluster. The only loss is in computational power, because there are fewer servers on which to distribute the work.
3.9 Managing a Hardware Failure
If a server starts failing, you must take steps to maintain the services of the cluster with as little interruption as possible. You can manage a failing server easily using the bdacli
utility, as described in the following procedures. One of the management steps is called decommissioning. Decommissioning stops all roles for all services, thereby preventing data loss. Cloudera Manager requires that you decommission a CDH node before retiring it.
When a noncritical node fails, there is no loss of service. However, when a critical node fails in a CDH cluster, services with a single point of failure are unavailable, as described in "Effects of Hardware on Software Availability". You must decide between these alternatives:
-
Wait for repairs to be made, and endure the loss of service until they are complete.
-
Move the critical services to another node. This choice may require that some clients are reconfigured with the address of the new node. For example, if the second ResourceManager node (typically node03) fails, then users must redirect their browsers to the new node to access Cloudera Manager.
You must weigh the loss of services against the inconvenience of reconfiguring the clients.
3.9.1 Prerequisites for Managing a Failing Node
Ensure that you do the following before managing a failing or failed server, whether it is configured as a CDH node or an Oracle NoSQL Database node:
-
Try restarting the services or rebooting the server.
-
Determine whether the failing node is critical or noncritical.
-
If the failing node is where Mammoth is installed:
-
For a CDH node, select a noncritical node in the same cluster as the failing node.
For a NoSQL node, repair or replace the failed server first, and use it for these steps.
-
Upload the Mammoth bundle to that node and unzip it.
-
Extract all files from
BDAMammoth-
version
.run
, using a command like the following:# ./BDAMammoth-ol6-4.0.0.run
Afterward, you must run all Mammoth operations from this node.
See Oracle Big Data Appliance Owner's Guide for information about the Mammoth utility.
-
Follow the appropriate procedure in this section for managing a failing node.
Mammoth is installed on the first node of the cluster, unless its services were migrated previously.
-
3.9.2 Managing a Failing CDH Critical Node
The procedure for dealing with a failed critical node is to migrate the critical services to another node.
After migrating the critical services from the failing node to another node, you have several options for reintegrating the failed node into the cluster:
- Reprovision the node
This procedure reinstalls all of the software required for the node to operate as a DataNode. Reprovisioning is the only option for a node that is not repairable and has been replaced.
- Recommission the node
If you can repair a failed critical node to the extent that the DataNode role is working and are sure that any problems that may interfere with the DataNode role are resolved, you may be able to save time by recommissioning the node instead of reprovisioning it. Recommissioning is a much faster process. It reintegrates the node as a DataNode, but does not perform a full reinstallation and there is no need to reimage. It reverses the decommissioning of the node. A decommission puts the node in quarantine, a recommission takes the node out of quarantine.
Note:
A special procedure is required before you can recommission a failed Node1. See Preliminary Steps for Recommissioning Node1 at the end of this section.
To manage a failing critical node:
-
Log in as
root
to the “Mammoth node.” (The node where Mammoth is installed. This is usually Node1.) -
Migrate the services to a noncritical node. (Replace node_name below with the name of the failing node.)
bdacli admin_cluster migrate node_name
When the command finishes, the failng node is decommissioned and its services are now running on a previously noncritical node.
-
You may want to communicate the change to your user community so that they can redirect their clients to the new critical node as required.
-
Repair or replace the failed server.
- As
root
on the Mammoth node, either reprovision or recommission the repaired or replaced server as a noncritical node. Use the same name as the migrated node for node_name, such as "bda1node02".- To reprovision the node:
Note:
If you intend to reprovision the node, it is recommended (though not required) that you reimage it first to ensure that there are no other problems with the software.# bdacli admin_cluster reprovision <node_name>
- To recommission the node:
# bdacli admin_cluster reprovision <node_name>
- To reprovision the node:
-
From the Mammoth node as
root
, reprovision the repaired or replaced server as a noncritical node. Use the same name as the migrated node for node_name, such as bda1node02:bdacli admin_cluster reprovision node_name
-
If the failed node supported services like HBase or Impala, which Mammoth installs but does not configure, then use Cloudera Manager to reconfigure them on the new node.
Preliminary Steps for Recommissioning Node1 Only
Before recommissioning a failed (and repaired) Node1, do the following:
- Determine where to relocate the Mammoth role. Mammoth ordinarily runs on Node1 and so when this node fails, the Mammoth role must be transferred to another node. It is best to avoid using other critical nodes and instead choose the first available DataNode.
- On each node of the cluster:
- Update
/opt/oracle/bda/install/state/config.json
. ChangeMAMMOTH_NODE
to point to the node where you plan to host the Mammoth role. - Update
/opt/oracle/bda/cluster-hosts-infiniband
to add Node1.
- Update
- On the node where you plan to host the Mammoth role:
# setup-root-ssh -C
- Use SSH to log on to each node as
root
and edit/opt/oracle/bda/install/state/config.json
. Remove Node1 from the QUARANTINED arrays –QUARANTINED_POSNS
andQUARANTINED_HOSTS
. - On the node where you plan to host the Mammoth role:
- Run
mammoth -z
.This node is now the new Mammoth node.
- Log on to each node in the cluster and in
/opt/oracle/bda/install/state/config.json
, re-enter Node1 into the QUARANTINED arrays.
- Run
- You can now recommission Node1. On the Mammoth node, run:
# bdacli admin_cluster recommission node0
3.9.3 Managing a Failing Noncritical Node
Use the following procedure to replace a failing node in either a CDH or a NoSQL cluster.
To manage a failing noncritical node:
-
Log in as
root
to the node where Mammoth is installed (typically Node1). -
Decommission the failing node. Replace node_name with the name of the failing node.
bdacli admin_cluster decommission node_name
In Configuraton Manager, verify that the node is decommissioned.
- After decommissioning the failed node, the next steps depend upon
which of these two conditions is true:
- The node can be repaired without reimaging.
- The node must be replaced, or, the node can be repaired,
but requires reimaging.
If a failed node cannot be accessed or if the OS is corrupted, it must be reimaged.
-
- If the node is replacement or must be reimaged, then follow instructions
in Document 1485745.1 in My Oracle
Support. This document provides the links to imaging instructions for
all Big Data Appliance releaeses.
After reimaging, then reprovision the node:
# bdacli admin_cluster reprovision node_name
After this, the node will be ready for recommissiong.
- If the existing node can be repaired without reimaging, then recommission it. There is no need for reprovisioning.
To recommission the node in either case, log to the Mammoth node as
root
on and run the following bdacli command. Use the same name as the decommissioned node for node_name:bdacli admin_cluster recommission node_name
- If the node is replacement or must be reimaged, then follow instructions
in Document 1485745.1 in My Oracle
Support. This document provides the links to imaging instructions for
all Big Data Appliance releaeses.
-
If the node is part of a CDH cluster, log into Cloudera Manager, and locate the recommissioned node. Check that HDFS DataNode, YARN NodeManager, and any other roles that should be running are showing a green status light. If they are not, then manually restart them.
See Also:
Oracle Big Data Appliance Owner's Guide for the complete bdacli
syntax
3.10 Stopping and Starting Oracle Big Data Appliance
This section describes how to shut down Oracle Big Data Appliance gracefully and restart it.
3.10.1 Prerequisites
You must have root
access. Passwordless SSH must be set up on the cluster, so that you can use the dcli
utility.
To ensure that passwordless-ssh is set up:
-
Log in to the first node of the cluster as
root
. -
Use a
dcli
command to verify it is working. This command should return the IP address and host name of every node in the cluster:# dcli -C hostname 192.0.2.1: bda1node01.example.com 192.0.2.2: bda1node02.example.com . . .
-
If you do not get these results, then set up
dcli
on the cluster:# setup-root-ssh -C
See Also:
Oracle Big Data Appliance Owner's Guide for details about these commands.
3.10.2 Stopping Oracle Big Data Appliance
Follow these procedures to shut down all Oracle Big Data Appliance software and hardware components.
Note:
The following services stop automatically when the system shuts down. You do not need to take any action:
-
Oracle Enterprise Manager agent
-
Auto Service Request agents
3.10.2.1 Stopping All Managed Services
Use Cloudera Manager to stop the services it manages, including flume
, hbase
, hdfs
, hive
, hue
, mapreduce
, oozie
, and zookeeper
.
3.10.2.2 Stopping Cloudera Manager Server
Follow this procedure to stop Cloudera Manager Server.
After stopping Cloudera Manager, you cannot access it using the web console.
3.10.2.4 Dismounting NFS Directories
All nodes share an NFS directory on node03, and additional directories may also exist. If a server with the NFS directory (/opt/exportdir
) is unavailable, then the other servers hang when attempting to shut down. Thus, you must dismount the NFS directories first.
3.10.2.5 Stopping the Servers
The Linux shutdown -h
command powers down individual servers. You can use the dcli -g
command to stop multiple servers.
3.10.2.6 Stopping the InfiniBand and Cisco Switches
To stop the network switches, turn off a PDU or a breaker in the data center. The switches only turn off when power is removed.
The network switches do not have power buttons. They shut down only when power is removed
To stop the switches, turn off all breakers in the two PDUs.
3.10.3 Starting Oracle Big Data Appliance
Follow these procedures to power up the hardware and start all services on Oracle Big Data Appliance.
3.10.3.1 Powering Up Oracle Big Data Appliance
- Switch on all 12 breakers on both PDUs.
- Allow 4 to 5 minutes for Oracle ILOM and the Linux operating system to start on the servers.
If the servers do not start automatically, then you can start them locally by pressing the power button on the front of the servers, or remotely by using Oracle ILOM. Oracle ILOM has several interfaces, including a command-line interface (CLI) and a web console. Use whichever interface you prefer.
For example, you can log in to the web interface as root
and start the server from the Remote Power Control page. The URL for Oracle ILOM is the same as for the host, except that it typically has a -c
or -ilom
extension. This URL connects to Oracle ILOM for bda1node4:
http://bda1node04-ilom.example.com
3.10.3.2 Starting the HDFS Software Services
Use Cloudera Manager to start all the HDFS services that it controls.
3.11 Auditing Oracle Big Data Appliance
Notice:
Audit Vault and Database Firewall is no longer supported for use with Oracle Big Data Appliance. It is recommended that customers use Cloudera Navigator for monitoring.3.12 Collecting Diagnostic Information for Oracle Customer Support
If you need help from Oracle Support to troubleshoot CDH issues, then you should first collect diagnostic information using the bdadiag
utility with the cm
option.
To collect diagnostic information:
-
Log in to an Oracle Big Data Appliance server as
root
. -
Run
bdadiag
with at least thecm
option. You can include additional options on the command line as appropriate. See the Oracle Big Data Appliance Owner's Guide for a complete description of thebdadiag
syntax.# bdadiag cm
The command output identifies the name and the location of the diagnostic file.
-
Go to My Oracle Support at
http://support.oracle.com
. -
Open a Service Request (SR) if you have not already done so.
-
Upload the
bz2
file into the SR. If the file is too large, then upload it tosftp.oracle.com
, as described in the next procedure.
To upload the diagnostics to ftp.oracle.com:
-
Open an SFTP client and connect to
sftp.oracle.com
. Specify port 2021 and remote directory/support/incoming/
target
, wheretarget
is the folder name given to you by Oracle Support. -
Log in with your Oracle Single Sign-on account and password.
-
Upload the diagnostic file to the new directory.
-
Update the SR with the full path and the file name.