Skip Headers
Oracle® Big Data Appliance Software User's Guide
Release 2 (2.0.1)

Part Number E36963-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

2 Administering Oracle Big Data Appliance

This chapter provides information about the software and services installed on Oracle Big Data Appliance. It contains these sections:

2.1 Monitoring a Cluster Using Oracle Enterprise Manager

An Oracle Enterprise Manager plug-in enables you to use the same system monitoring tool for Oracle Big Data Appliance as you use for Oracle Exadata Database Machine or any other Oracle Database installation. With the plug-in, you can view the status of the installed software components in tabular or graphic presentations, and start and stop these software services. You can also monitor the health of the network and the rack components.

After selecting a target cluster, you can drill down into these primary areas:

Figure 2-1 shows some of the information provided about the InfiniBand switches.

Figure 2-1 InfiniBand Home in Oracle Enterprise Manager

Description of Figure 2-1 follows
Description of "Figure 2-1 InfiniBand Home in Oracle Enterprise Manager"

To monitor Oracle Big Data Appliance using Oracle Enterprise Manager: 

  1. Download and install the plug-in. See Oracle Enterprise Manager System Monitoring Plug-in Installation Guide for Oracle Big Data Appliance.

  2. Log in to Oracle Enterprise Manager as a privileged user.

  3. From the Targets menu, choose Big Data Appliance to view the Big Data page. You can see the overall status of the targets already discovered by Oracle Enterprise Manager.

  4. Select a target cluster to view its detail pages.

  5. Expand the target navigation tree to display the components. Information is available at all levels.

  6. Select a component in the tree to display its home page.

  7. To change the display, choose an item from the drop-down menu at the top left of the main display area.

2.2 Managing CDH Operations Using Cloudera Manager

Cloudera Manager is installed on Oracle Big Data Appliance to help you with Cloudera's Distribution including Apache Hadoop (CDH) operations. Cloudera Manager provides a single administrative interface to all Oracle Big Data Appliance servers configured as part of the Hadoop cluster.

Cloudera Manager simplifies the performance of these administrative tasks:

Cloudera Manager runs on the JobTracker node (node03) of the primary rack and is available on port 7180.

To use Cloudera Manager: 

  1. Open a browser and enter a URL like the following:

    http://bda1node03.example.com:7180
    

    In this example, bda1 is the name of the appliance, node03 is the name of the server, example.com is the domain, and 7180 is the default port number for Cloudera Manager.

  2. Log in with a user name and password for Cloudera Manager. Only a user with administrative privileges can change the settings. Other Cloudera Manager users can view the status of Oracle Big Data Appliance.

See Also:

Cloudera Manager User Guide at

http://ccp.cloudera.com/display/ENT/Cloudera+Manager+User+Guide

or click Help on the Cloudera Manager Help menu

2.2.1 Monitoring the Status of Oracle Big Data Appliance

In Cloudera Manager, you can choose any of the following pages from the menu bar across the top of the display:

  • Services: Monitors the status and health of services running on Oracle Big Data Appliance. Click the name of a service to drill down to additional information.

  • Hosts: Monitors the health, disk usage, load, physical memory, swap space, and other statistics for all servers.

  • Activities: Monitors all MapReduce jobs running in the selected time period.

  • Logs: Collects historical information about the systems and services. You can search for a particular phrase for a selected server, service, and time period. You can also select the minimum severity level of the logged messages included in the search: TRACE, DEBUG, INFO, WARN, ERROR, or FATAL.

  • Events: Records a change in state and other noteworthy occurrences. You can search for one or more keywords for a selected server, service, and time period. You can also select the event type: Audit Event, Activity Event, Health Check, or Log Message.

  • Reports: Generates reports on demand for disk and MapReduce use.

Figure 2-2 shows the opening display of Cloudera Manager, which is the Services page.

Figure 2-2 Cloudera Manager Services Page

Description of Figure 2-2 follows
Description of "Figure 2-2 Cloudera Manager Services Page"

2.2.2 Performing Administrative Tasks

As a Cloudera Manager administrator, you can change various properties for monitoring the health and use of Oracle Big Data Appliance, add users, and set up Kerberos security.

To access Cloudera Manager Administration: 

  1. Log in to Cloudera Manager with administrative privileges.

  2. Click Welcome admin at the top right of the page.

2.2.3 Managing Services With Cloudera Manager

Cloudera Manager provides the interface for managing these services:

  • HDFS

  • Hue

  • MapReduce

  • Oozie

  • ZooKeeper

You can use Cloudera Manager to change the configuration of these services, stop, and restart them.

Note:

Manual edits to Linux service scripts or Hadoop configuration files do not affect these services. You must manage and configure them using Cloudera Manager.

2.3 Using Hadoop Monitoring Utilities

Users can monitor MapReduce jobs without providing a Cloudera Manager user name and password.

2.3.1 Monitoring the JobTracker

Hadoop Map/Reduce Administration monitors the JobTracker, which runs on port 50030 of the JobTracker node (node03) on Oracle Big Data Appliance.

To monitor the JobTracker: 

  • Open a browser and enter a URL like the following:

    http://bda1node03.example.com:50030
    

    In this example, bda1 is the name of the appliance, node03 is the name of the server, and 50030 is the default port number for Hadoop Map/Reduce Administration.

Figure 2-3 shows part of a Hadoop Map/Reduce Administration display.

Figure 2-3 Hadoop Map/Reduce Administration

Description of Figure 2-3 follows
Description of "Figure 2-3 Hadoop Map/Reduce Administration"

2.3.2 Monitoring the TaskTracker

The Task Tracker Status interface monitors the TaskTracker on a single node. It is available on port 50060 of all noncritical nodes (node04 to node18) in Oracle Big Data Appliance.

To monitor a TaskTracker: 

  • Open a browser and enter the URL for a particular node like the following:

    http://bda1node13.example.com:50060
    

    In this example, bda1 is the name of the rack, node13 is the name of the server, and 50060 is the default port number for the Task Tracker Status interface.

Figure 2-4 shows the Task Tracker Status interface.

Figure 2-4 Task Tracker Status Interface

Description of Figure 2-4 follows
Description of "Figure 2-4 Task Tracker Status Interface"

2.4 Using Hue to Interact With Hadoop

Hue runs in a browser and provides an easy-to-use interface to several applications to support interaction with Hadoop and HDFS. You can use Hue to perform any of the following tasks:

Hue runs on port 8888 of the JobTracker node (node03).

To use Hue: 

  1. Open Hue in a browser using an address like the one in this example:

    http://bda1node03.example.com:8888

    In this example, bda1 is the cluster name, node03 is the server name, and example.com is the domain.

  2. Log in with your Hue credentials.

    Oracle Big Data Appliance is not configured initially with any Hue user accounts. The first user who connects to Hue can log in with any user name and password, and automatically becomes an administrator. This user can create other user and administrator accounts.

  3. Use the icons across the top to open a utility.

Figure 2-5 shows the Beeswax Query Editor for entering Hive queries.

Figure 2-5 Beeswax Query Editor

Description of Figure 2-5 follows
Description of "Figure 2-5 Beeswax Query Editor"

See Also:

Hue Installation Guide for information about using Hue, which is already installed and configured on Oracle Big Data Appliance, at

http://cloudera.github.com/hue/docs-2.1.0/manual.html

2.5 Providing Remote Client Access to CDH

Oracle Big Data Appliance supports full local access to all commands and utilities in Cloudera's Distribution including Apache Hadoop (CDH).

You can use a browser on any computer that has access to the client network of Oracle Big Data Appliance to access Cloudera Manager, Hadoop Map/Reduce Administration, the Hadoop Task Tracker interface, and other browser-based Hadoop tools.

To issue Hadoop commands remotely, however, you must connect from a system configured as a CDH client with access to the Oracle Big Data Appliance client network. This section explains how to set up a computer so that you can access HDFS and submit MapReduce jobs on Oracle Big Data Appliance.

To follow these procedures, you must have these access privileges:

If you do not have these access privileges, then contact your system administrator for help.

See Also:

My Oracle Support ID 1506203.1

2.5.1 Installing CDH on the Client System

The system that you use to access Oracle Big Data Appliance must run an operating system that Cloudera supports for CDH4. For the list of supported operating systems, see "Before You Install CDH4 on a Cluster" in the Cloudera CDH4 Installation Guide at

http://ccp.cloudera.com/display/CDH4DOC/Before+You+Install+CDH4+on+a+Cluster

To install the CDH client software: 

  1. Follow the installation instructions for your operating system provided in the Cloudera CDH4 Installation Guide at

    http://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide

    When you are done installing the Hadoop core and native packages, the system can act as a basic CDH client.

    Note:

    Be sure to install CDH4 Update 1 (CDH4u1) or a later version.
  2. To provide support for other components, such as Hive, Pig, or Oozie, see the component installation instructions.

2.5.2 Configuring CDH

After installing CDH, you must configure it for use with Oracle Big Data Appliance.

To configure the Hadoop client: 

  1. Open a browser on your client system and connect to Cloudera Manager. It runs on the JobTracker node (node03) and listens on port 7180, as shown in this example:

    http://bda1node03.example.com:7180
    
  2. Log in as admin.

  3. Cloudera Manager opens on the Services tab. Click the Client Configuration URLs button.

  4. In the popup window, click the URL for mapreduce1 (/cmf/services/2/client-config) to download mapreduce1-clientconfig.zip.

    The following figure shows the download page for the client configuration.

    Description of generate_client.png follows
    Description of the illustration generate_client.png

  5. Unzip mapreduce1-clientconfig.zip into a permanent location on the client system.

    $ unzip mapreduce-clientconfig.zip
    Archive:  mapreduce-clientconfig.zip
      inflating: hadoop-conf/hadoop-env.sh
      inflating: hadoop-conf/core-site.xml
      inflating: hadoop-conf/hdfs-site.xml
      inflating: hadoop-conf/log4j.properties
      inflating: hadoop-conf/mapred-site.xml
    

    All files are stored in a subdirectory named hadoop-config.

  6. Open hadoop-env.sh in a text editor and change JAVA_HOME to the correct location on your system:

    export JAVA_HOME=full_directory_path
    
  7. Delete the number sign (#) to uncomment the line, and then save the file.

  8. Do one of the following:

    • Overwrite the existing configuration files with the downloaded configuration files in Step 5.

      # cd /full_path/hadoop-conf
      # cp * /usr/lib/hadoop/conf
      
    • Set the HADOOP_CONF_DIR environment variable to the new hadoop-config directory, using the appropriate syntax for your shell. For example, this command is for the Bash shell:

      $ export HADOOP_CONF_DIR="/full_path/hadoop-conf"
      

      Note:

      Be sure to add this setting to the appropriate startup files for your environment, such as .bashrc and .cshrc.
  9. Verify that you can access HDFS on Oracle Big Data Appliance from the client, by entering a simple Hadoop file system command like the following:

    $ hadoop fs -ls /user
    Found 4 items
    drwx------   - hdfs     supergroup          0 2013-01-16 13:50 /user/hdfs
    drwxr-xr-x   - hive     supergroup          0 2013-01-16 12:58 /user/hive
    drwxr-xr-x   - oozie    hadoop              0 2013-01-16 13:01 /user/oozie
    drwxr-xr-x   - oracle   hadoop              0 2013-01-29 12:50 /user/oracle
    

    Check the output for HDFS users defined on Oracle Big Data Appliance, and not on the client system.

  10. Validate the installation by submitting a MapReduce job. You must be logged in to the host computer under the same user name as your HDFS user name on Oracle Big Data Appliance.

    The following example calculates the value of pi:

    $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar pi 10 1000000
    Number of Maps  = 10
    Samples per Map = 1000000
    Wrote input for Map #0
    Wrote input for Map #1
         .
         .
         .
    Job Finished in 17.981 seconds
    Estimated value of Pi is 3.14158440000000000000
    
  11. Use Cloudera Manager to verify that the job ran on Oracle Big Data Appliance instead of the local system. Select mapreduce from the Activities menu for a list of jobs.

Figure 2-6 shows the job created by the previous example.

Figure 2-6 Monitoring a MapReduce Job in Cloudera Manager

Description of Figure 2-6 follows
Description of "Figure 2-6 Monitoring a MapReduce Job in Cloudera Manager"

2.6 Managing User Accounts

This section describes the users created for use by the software, and explains how to create additional users. It contains the following topics:

2.6.1 About Predefined Users and Groups

Every open-source package installed on Oracle Big Data Appliance creates one or more users and groups. Most of these users do not have login privileges, shells, or home directories. They are used by daemons and are not intended as an interface for individual users. For example, Hadoop operates as the hdfs user, MapReduce operates as mapred, and Hive operates as hive.

You can use the oracle identity to run Hadoop and Hive jobs immediately after the Oracle Big Data Appliance software is installed. This user account has login privileges, a shell, and a home directory.

Oracle NoSQL Database and Oracle Data Integrator run as the oracle user. Its primary group is oinstall.

Note:

Do not delete or modify the users created during installation, because they are required for the software to operate.

Table 2-1 identifies the operating system users and groups that are created automatically during installation of Oracle Big Data Appliance software for use by CDH components and other software packages.

Table 2-1 Operating System Users and Groups

User Name Group Used By Login Rights

flume

flume

Flume parent and nodes

No

hbase

hbase

HBase processes

No

hdfs

hadoop

NameNode, DataNode

No

hive

hive

Hive metastore and server processes

No

hue

hue

Hue processes

No

mapred

hadoop

JobTracker, TaskTracker, Hive Thrift daemon

Yes

mysql

mysql

MySQL server

Yes

oozie

oozie

Oozie server

No

oracle

dba, oinstall

Oracle NoSQL Database, Oracle Loader for Hadoop, Oracle Data Integrator, and the Oracle DBA

Yes

puppet

puppet

Puppet parent (puppet nodes run as root)

No

sqoop

sqoop

Sqoop metastore

No

svctag

 

Auto Service Request

No

zookeeper

zookeeper

ZooKeeper processes

No


2.6.2 Creating New HDFS Users

When creating additional user accounts, define them as follows:

  • To run MapReduce jobs, users must be in the hadoop group.

  • To create and modify tables in Hive, users must be in the hive group.

  • To create Hue users, open Hue in a browser and click the User Admin icon. See "Using Hue to Interact With Hadoop."

To create an HDFS user:  

  1. Open an ssh connection as the root user to a noncritical node (node04 to node18).

  2. Create the user's home directory:

    # sudo -u hdfs hadoop fs -mkdir /user/user_name
    

    You use sudo because the HDFS super user is hdfs (not root).

  3. Change the ownership of the directory:

    # sudo -u hdfs hadoop fs -chown user_name:primary_group_name /user/user_name
    
  4. Verify that the directory is set up correctly:

    # hadoop fs -ls /user
    
  5. Create the operating system user across all nodes in the cluster:

    # dcli useradd -G group_name [,group_name...] -m user_name
    

    In this syntax, replace group_name with an existing group and user_name with the new name.

Users do not need login privileges on Oracle Big Data Appliance to run MapReduce jobs from a remote client. However, for those who want to log in to Oracle Big Data Appliance, you must set a password. You can set or reset a password the same way.

To set a user password across all Oracle Big Data Appliance servers:  

  1. Create an HDFS user as described in the previous procedure.

  2. Confirm that the user does not have a password:

    # dcli passwd -S user_name
    bda1node01.example.com: jdoe NP 2013-01-22 0 99999 7 -1 (Empty password.)
    bda1node02.example.com: jdoe NP 2013-01-22 0 99999 7 -1 (Empty password.)
    bda1node03.example.com: jdoe NP 2013-01-22 0 99999 7 -1 (Empty password.)
    

    The output shows that no password is set for user jdoe.

  3. Set the password:

    hash=$(echo 'password' | openssl passwd -1 -stdin); dcli "usermod --pass='$hash' user_name"
    
  4. Confirm that the password is set across all servers:

    ]# dcli passwd -S user_name
    bda1node01.example.com: jdoe PS 2013-01-24 0 99999 7 -1 (Password set, MD5 crypt.)
    bda1node02.example.com: jdoe PS 2013-01-24 0 99999 7 -1 (Password set, MD5 crypt.)
    bda1node03.example.com: jdoe PS 2013-01-24 0 99999 7 -1 (Password set, MD5 crypt.)
    

Example 2-1 creates a user named jdoe with a primary group of hadoop and an addition group of hive.

Example 2-1 Creating a Hadoop User

]# sudo -u hdfs hadoop fs -mkdir /user/jdoe
# sudo -u hdfs hadoop fs -chown jdoe:hadoop /user/jdoe
# hadoop fs -ls /user
Found 5 items
drwx------   - hdfs     supergroup          0 2013-01-16 13:50 /user/hdfs
drwxr-xr-x   - hive     supergroup          0 2013-01-16 12:58 /user/hive
drwxr-xr-x   - jdoe     hadoop              0 2013-01-18 14:04 /user/jdoe
drwxr-xr-x   - oozie    hadoop              0 2013-01-16 13:01 /user/oozie
drwxr-xr-x   - oracle   hadoop              0 2013-01-16 13:01 /user/oracle
# dcli useradd -G hive -m jdoe]
# dcli ls /home
bda1node01.example.com: hive
bda1node01.example.com: jdoe
bda1node01.example.com: oracle
bda1node02.example.com: hive
bda1node02.example.com: jdoe
     .
     .
     .

See Also:

2.7 Recovering Deleted Files

CDH provides an optional trash facility, so that a deleted file or directory is moved to a trash directory for a set period of time instead of being deleted immediately from the system. By default, the trash facility is enabled for HDFS and all HDFS clients.

2.7.1 Restoring Files from the Trash

When the trash facility is enabled, you can easily restore files that were previously deleted.

To restore a file from the trash directory: 

  1. Check that the deleted file is in the trash. The following example checks for files deleted by the oracle user:

    $ hadoop fs -ls .Trash/Current/user/oracle
    Found 1 items
    -rw-r--r--  3 oracle hadoop  242510990 2012-08-31 11:20 /user/oracle/.Trash/Current/user/oracle/ontime_s.dat
    
  2. Move or copy the file to its previous location. The following example moves ontime_s.dat from the trash to the HDFS /user/oracle directory.

    $ hadoop fs -mv .Trash/Current/user/oracle/ontime_s.dat /user/oracle/ontime_s.dat
    

2.7.2 Changing the Trash Interval

The trash interval is the minimum number of minutes that a file remains in the trash directory before being deleted permanently from the system. The default value is 1 day (24 hours).

To change the trash interval: 

  1. Open Cloudera Manager. See "Managing CDH Operations Using Cloudera Manager".

  2. On the All Services page under Name, click hdfs1.

  3. On the hdfs1 page, click the Configuration subtab.

  4. Search for or scroll down to the Filesystem Trash Interval property under NameNode Settings. See Figure 2-7.

  5. Click the current value, and enter a new value in the pop-up form.

  6. Click Save Changes.

  7. Expand the Actions menu at the top of the page and choose Restart.

Figure 2-7 shows the Filesystem Trash Interval property in Cloudera Manager.

Figure 2-7 HDFS Property Settings in Cloudera Manager

Description of Figure 2-7 follows
Description of "Figure 2-7 HDFS Property Settings in Cloudera Manager"

2.7.3 Disabling the Trash Facility

The trash facility on Oracle Big Data Appliance is enabled by default. You can change this configuration at the server or the client level. When the trash facility is disabled, deleted files and directories are not moved to the trash. They are not recoverable.

2.7.3.1 Completely Disabling the Trash Facility

The following procedure disables the trash facility for HDFS. When the trash facility is completely disabled, the client configuration is irrelevant.

To completely disable the trash facility: 

  1. Open Cloudera Manager. See "Managing CDH Operations Using Cloudera Manager".

  2. On the All Services page under Name, click hdfs1.

  3. On the hdfs1 page, click the Configuration subtab.

  4. Search for or scroll down to the Filesystem Trash Interval property under NameNode Settings. See Figure 2-7.

  5. Click the current value, and enter a value of 0 (zero) in the pop-up form.

  6. Click Save Changes.

  7. Expand the Actions menu at the top of the page and choose Restart.

2.7.3.2 Disabling the Trash Facility for Local HDFS Clients

All HDFS clients that are installed on Oracle Big Data Appliance are configured to use the trash facility. An HDFS client is any software that connects to HDFS to perform operations such as listing HDFS files, copying files to and from HDFS, and creating directories.

You can use Cloudera Manager to change the local client configuration setting, although the trash facility is still enabled.

Note:

If you do not want any clients to use the trash, then you can completely disable the trash facility. See "Completely Disabling the Trash Facility."

To disable the trash facility for local HDFS clients:  

  1. Open Cloudera Manager. See "Managing CDH Operations Using Cloudera Manager".

  2. On the All Services page under Name, click hdfs1.

  3. On the hdfs1 page, click the Configuration subtab.

  4. Search for or scroll down to the Use Trash property under Client Settings. See Figure 2-7.

  5. Deselect the Use Trash check box.

  6. Click Save Changes. This setting is used to configure all new HDFS clients downloaded to Oracle Big Data Appliance.

  7. To deploy the new configuration to existing clients:

    1. Expand the Actions menu and choose Deploy Client Configuration.

    2. At the prompt to confirm the action, click Deploy Client Configuration.

2.7.3.3 Disabling the Trash Facility for a Remote HDFS Client

Remote HDFS clients are typically configured by downloading and installing a CDH client, as described in "Providing Remote Client Access to CDH." Oracle SQL Connector for HDFS and Oracle R Connector for Hadoop are examples of remote clients.

To disable the trash facility for a remote HDFS client: 

  1. Open a connection to the system where the CDH client is installed.

  2. Open /etc/hadoop/conf/hdfs-site.xml in a text editor.

  3. Change the trash interval to zero:

    <property>
         <name>fs.trash.interval</name>
         <value>0</value>
    </property>
    
  4. Save the file.

2.8 About the Oracle Big Data Appliance Software

The following sections identify the software installed on Oracle Big Data Appliance and where it runs in the rack. Some components operate with Oracle Database 11.2.0.2 and later releases.

2.8.1 Software Components

These software components are installed on all 18 servers in Oracle Big Data Appliance Rack. Oracle Linux, required drivers, firmware, and hardware verification utilities are factory installed. All other software is installed on site using the Mammoth Utility. The optional software components may not be configured in your installation.

Note:

You do not need to install additional software on Oracle Big Data Appliance. Doing so may result in a loss of warranty and support. See the Oracle Big Data Appliance Owner's Guide.

Base image software: 

Mammoth installation: 

  • Cloudera's Distribution including Apache Hadoop Release 4 Update 1.2 (CDH)

  • Cloudera Manager Enterprise 4.1.2

  • Oracle Database Instant Client 11.2.0.3

  • Oracle NoSQL Database Community Edition 11g Release 2.0 (optional)

  • Oracle Big Data Connectors 2.0 (optional):

    • Oracle SQL Connector for Hadoop Distributed File System (HDFS)

    • Oracle Loader for Hadoop

    • Oracle Data Integrator Agent 11.1.1.6.0

    • Oracle R Connector for Hadoop

See Also:

Oracle Big Data Appliance Owner's Guide for information about the Mammoth Utility

Figure 2-8 shows the relationships among the major components.

Figure 2-8 Major Software Components of Oracle Big Data Appliance

Description of Figure 2-8 follows
Description of "Figure 2-8 Major Software Components of Oracle Big Data Appliance"

2.8.2 Logical Disk Layout

Each server has 12 disks. The critical operating system is stored on disks 1 and 2.

Table 2-2 describes how the disks are partitioned.

Table 2-2 Logical Disk Layout

Disk Description

1 to 2

150 gigabytes (GB) physical and logical partition, mirrored to create two copies, with the Linux operating system, all installed software, NameNode data, and MySQL Database data. The NameNode and MySQL Database data are replicated on two servers for a total of four copies.

2.8 terabytes (TB) HDFS data partition

3 to 10

Single HDFS data partition

11 to 12

Single Oracle NoSQL Database partition, if activated during software installation; otherwise, a single HDFS data partition


2.9 About the Software Services

This section contains the following topics:

2.9.1 Monitoring the CDH Services

You can use Cloudera Manager to monitor the CDH services on Oracle Big Data Appliance.

To monitor the services: 

  1. In Cloudera Manager, click the Services tab at the top of the page to display the Services page.

  2. Click the name of a service to see its detail pages. The service opens on the Status page.

  3. Click the link to the page that you want to view: Status, Instances, Commands, Configuration, or Audits.

2.9.2 Where Do the Services Run?

All services are installed on all servers, but individual services run only on designated nodes in the Hadoop cluster.

Table 2-3 identifies the nodes where the services run on the primary rack. Services that run on all nodes run on all racks of a multirack installation.

Table 2-3 Software Service Locations

Service Type Role Node Name Initial Node Position

Cloudera Management Services

Cloudera Manager agents

All nodes

Node01 to node18

Cloudera Management Services

Cloudera Manager server

JobTracker node

Node02

HDFS

Balancer

First NameNode

Node01

HDFS

DataNode

All nodes

Node01 to node18

HDFS

Failover controller

First NameNode and second NameNode

Node01 and node02

HDFS

First NameNode

First NameNode

Node01

HDFS

JournalNode

First NameNode, second NameNode, JobTracker node

Node01 to node03

HDFS

Second NameNode

Second NameNode

Node02

Hive

Hive server

JobTracker node

Node03

Hue

Beeswax server

JobTracker node

Node03

Hue

Hue server

JobTracker node

Node03

MapReduce

JobTracker

JobTracker node

Node03

MapReduce

TaskTracker

All noncritical nodes

Node04 to node18

MySQL

MySQL Backup ServerFoot 1 

Second NameNode

Node02

MySQL

MySQL Primary ServerFootref 1

JobTracker node

Node03

NoSQL

Oracle NoSQL Database AdministrationFootref 2

Second NameNode

Node02

NoSQL

Oracle NoSQL Database Server processesFootref 2

All nodes

Node01 to node18

ODI

Oracle Data Integrator agentFoot 2 

JobTracker node

Node03

Puppet

Puppet agents

All nodes

Node01 to node18

Puppet

Puppet master

First NameNode

Node01

ZooKeeper

ZooKeeper server

First NameNode, second NameNode, JobTracker node

Node01 to node03


Footnote 1 If the software was upgraded from version 1.0, then MySQL Backup remains on node02 and MySQL Primary Server remains on node03.

Footnote 2 Started only if requested in the Oracle Big Data Appliance Configuration Worksheets

2.9.3 Automatic Failover of the NameNode

The NameNode is the most critical process because it keeps track of the location of all data. Without a healthy NameNode, the entire cluster fails. Apache Hadoop v0.20.2 and earlier are vulnerable to failure because they have a single name node.

Cloudera's Distribution including Apache Hadoop Version 4 (CDH4) reduces this vulnerability by maintaining redundant NameNodes. The data is replicated during normal operation as follows:

  • CDH maintains redundant NameNodes on the first two nodes. One of the NameNodes is in active mode, and the other NameNode is in hot standby mode. If the active NameNode fails, then the role of active NameNode automatically fails over to the standby NameNode.

  • The NameNode data is written to a mirrored partition so that the loss of a single disk can be tolerated. This mirroring is done at the factory as part of the operating system installation.

  • The active NameNode records all changes in at least two JournalNode processes, which the standby NameNode reads. There are three JournalNodes, which run on node01 to node03.

Note:

Oracle Big Data Appliance 2.0 does not support the use of an external NFS filer for backups and does not use NameNode federation.

Figure 2-9 shows the relationships among the processes that support automatic failover on Oracle Big Data Appliance.

Figure 2-9 Automatic Failover of the NameNode on Oracle Big Data Appliance

Description of Figure 2-9 follows
Description of "Figure 2-9 Automatic Failover of the NameNode on Oracle Big Data Appliance"

2.9.4 Unconfigured Software

The following tools are installed but not configured. Before using them, you must configure them for your use.

  • Flume

  • HBase

  • Mahout

  • Sqoop

  • Whirr

See Also:

CDH4 Installation and Configuration Guide for configuration procedures at

http://oracle.cloudera.com

2.10 Effects of Hardware on Software Availability

The effects of a server failure vary depending on the server's function within the CDH cluster. Oracle Big Data Appliance servers are more robust than commodity hardware, so you should experience fewer hardware failures. This section highlights the most important services that run on the various servers of the primary rack. For a full list, see Table 2-3.

2.10.1 Critical and Noncritical Nodes

Critical nodes are required for the cluster to operate normally and provide all services to users. In contrast, the cluster continues to operate with no loss of service when a noncritical node fails.

The critical services are installed initially on the first three nodes of the primary rack. Table 2-4 identifies the critical services that run on these nodes. The remaining nodes (initially node04 to node18) only run noncritical services. If a hardware failure occurs on one of the critical nodes, then the services can be moved to another, noncritical server. For example, if node02 fails, its critical services might be moved to node05. Table 2-4 provides names to identify the nodes providing critical services.

Moving a critical node requires that all clients be reconfigured with the address of the new node. The other alternative is to wait for the repair of the failed server. You must weigh the loss of services against the inconvenience of reconfiguring the clients.

Table 2-4 Critical Nodes

Node Name Initial Node Position Critical Functions

First NameNode

Node01

ZooKeeper, first NameNode, failover controller, balancer, puppet master

Second NameNode

Node02

ZooKeeper, second NameNode, failover controller, MySQL backup server

JobTracker Node

Node03

ZooKeeper, JobTracker, Cloudera Manager server, Oracle Data Integrator agent, MySQL primary server, Hue, Hive


2.10.2 First NameNode

One instance of the NameNode initially runs on node01. If this node fails or goes offline (such as a reboot), then the second NameNode (node02) automatically takes over to maintain the normal activities of the cluster.

Alternatively, if the second NameNode is already active, it continues without a backup. With only one NameNode, the cluster is vulnerable to failure. The cluster has lost the redundancy needed for automatic failover of the active NameNode.

These functions are also disrupted:

  • Balancer: The balancer runs periodically to ensure that data is distributed evenly across the cluster. Balancing is not performed when the first NameNode is down.

  • Puppet master: The Mammoth utilities use Puppet, and so you cannot install or reinstall the software if, for example, you must replace a disk drive elsewhere in the rack.

2.10.3 Second NameNode

One instance of the NameNode initially runs on node02. If this node fails, then the function of the NameNode either fails over to the first NameNode (node01) or continues there without a backup. However, the cluster has lost the redundancy needed for automatic failover if the first NameNode also fails.

These services are also disrupted:

  • MySQL Master Database: Cloudera Manager, Oracle Data Integrator, Hive, and Oozie use MySQL Database. The data is replicated automatically, but you cannot access it when the master database server is down.

  • Oracle NoSQL Database KV Administration: Oracle NoSQL Database database is an optional component of Oracle Big Data Appliance, so the extent of a disruption due to a node failure depends on whether you are using it and how critical it is to your applications.

2.10.4 JobTracker Node

The JobTracker assigns MapReduce tasks to specific nodes in the CDH cluster. Without the JobTracker node (node03), this critical function is not performed.

These services are also disrupted:

  • Cloudera Manager: This tool provides central management for the entire CDH cluster. Without this tool, you can still monitor activities using the utilities described in "Using Hadoop Monitoring Utilities".

  • Oracle Data Integrator: This service supports Oracle Data Integrator Application Adapter for Hadoop. You cannot use this connector when the JobTracker node is down.

  • Hive: Hive provides a SQL-like interface to data that is stored in HDFS. Most of the Oracle Big Data Connectors can access Hive tables, which are not available if this node fails.

  • MySQL Backup Database: MySQL Server continues to run, although there is no backup of the master database.

2.10.5 Noncritical Nodes

The noncritical nodes (node04 to node18) are optional in that Oracle Big Data Appliance continues to operate with no loss of service if a failure occurs. The NameNode automatically replicates the lost data to maintain three copies at all times. MapReduce jobs execute on copies of the data stored elsewhere in the cluster. The only loss is in computational power, because there are fewer servers on which to distribute the work.

2.11 Collecting Diagnostic Information for Oracle Customer Support

If you need help from Oracle Support to troubleshoot CDH issues, then you should first collect diagnostic information using the bdadiag utility with the cm option.

To collect diagnostic information: 

  1. Log in to an Oracle Big Data Appliance server as root.

  2. Run bdadiag with at least the cm option. You can include additional options on the command line as appropriate. See the Oracle Big Data Appliance Owner's Guide for a complete description of the bdadiag syntax.

    # bdadiag cm
    

    The command output identifies the name and the location of the diagnostic file.

  3. Go to My Oracle Support at http://support.oracle.com.

  4. Open a Service Request (SR) if you have not already done so.

  5. Upload the bz2 file into the SR. If the file is too large, then upload it to ftp.oracle.com, as described in the next procedure.

To upload the diagnostics to ftp.oracle.com: 

  1. Open an FTP client and connect to ftp.oracle.com.

    See Example 2-2 if you are using a command-line FTP client from Oracle Big Data Appliance.

  2. Log in as user anonymous and leave the password field blank.

  3. In the bda/incoming directory, create a directory using the SR number for the name, in the format SRnumber. The resulting directory structure looks like this:

    bda
       incoming
          SRnumber
    
  4. Set the binary option to prevent corruption of binary data.

  5. Upload the diagnostic bz2 file to the new directory.

  6. Update the SR with the full path, which has the form bda/incoming/SRnumber, and the file name.

Example 2-2 shows the commands to upload the diagnostics using the FTP command interface on Oracle Big Data Appliance.

Example 2-2 Uploading Diagnostics Using FTP

# ftp
ftp> open ftp.oracle.com
Connected to bigip-ftp.oracle.com.
220-***********************************************************************
220-Oracle FTP Server
         .
         .
         .
220-****************************************************************************
220- 
220
530 Please login with USER and PASS.
530 Please login with USER and PASS.
KERBEROS_V4 rejected as an authentication type
Name (ftp.oracle.com:root): anonymous
331 Please specify the password.
Password:
230 Login successful.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd bda/incoming
250 Directory successfully changed.
ftp> mkdir SR12345
257 "/bda/incoming/SR12345" created
ftp> cd SR12345
250 Directory successfully changed.
ftp> put /tmp/bdadiag_bda1node01_1216FM5497_2013_01_18_07_33.tar.bz2
local: bdadiag_bda1node01_1216FM5497_2013_01_18_07_33.tar.bz2 
remote: bdadiag_bda1node01_1216FM5497_2013_01_18_07_33.tar.bz2
227 Entering Passive Mode (141,146,44,21,212,32)
150 Ok to send data.
226 File receive OK.
2404836 bytes sent in 1.8 seconds (1.3e+03 Kbytes/s)

2.12 Security on Oracle Big Data Appliance

You can take precautions to prevent unauthorized use of the software and data on Oracle Big Data Appliance.

This section contains these topics:

2.12.1 Port Numbers Used on Oracle Big Data Appliance

Table 2-5 identifies the port numbers that might be used in addition to those used by CDH. For the full list of CDH port numbers, go to the Cloudera website at

http://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4

To view the ports used on a particular server: 

  1. In Cloudera Manager, click the Hosts tab at the top of the page to display the Hosts page.

  2. In the Name column, click a server link to see its detail page.

  3. Scroll down to the Ports section.

See Also:

The Cloudera website for CDH port numbers:

Table 2-5 Oracle Big Data Appliance Port Numbers

Service Port

Automated Service Monitor (ASM)

30920

MySQL Database

3306

Oracle Data Integrator Agent

20910

Oracle NoSQL Database administration

5001

Oracle NoSQL Database processes

5010 to 5020

Oracle NoSQL Database registration

5000

Port map

111

Puppet master service

8140

Puppet node service

8139

rpc.statd

668

ssh

22

xinetd (service tag)

6481


2.12.2 Using Kerberos for CDH Security

Apache Hadoop is not an inherently secure system. It is protected only by network security. After a connection is established, a client has full access to the system.

Cloudera's Distribution including Apache Hadoop (CDH) supports Kerberos network authentication protocol to prevent malicious impersonation. You must install and configure Kerberos and set up a Kerberos Key Distribution Center and realm. Then you configure various components of CDH to use Kerberos.

CDH provides these securities when configured to use Kerberos:

  • The CDH master nodes, NameNode, and JobTracker resolve the group name so that users cannot manipulate their group memberships.

  • Map tasks run under the identity of the user who submitted the job.

  • Authorization mechanisms in HDFS and MapReduce help control user access to data.

See Also:

http://oracle.cloudera.com for these manuals:
  • CDH4 Security Guide

  • Configuring Hadoop Security with Cloudera Manager

  • Configuring TLS Security for Cloudera Manager

2.12.3 About Puppet Security

The puppet node service (puppetd) runs continuously as root on all servers. It listens on port 8139 for "kick" requests, which trigger it to request updates from the puppet master. It does not receive updates on this port.

The puppet master service (puppetmasterd) runs continuously as the puppet user on the first server of the primary Oracle Big Data Appliance rack. It listens on port 8140 for requests to push updates to puppet nodes.

The puppet nodes generate and send certificates to the puppet master to register initially during installation of the software. For updates to the software, the puppet master signals ("kicks") the puppet nodes, which then request all configuration changes from the puppet master node that they are registered with.

The puppet master sends updates only to puppet nodes that have known, valid certificates. Puppet nodes only accept updates from the puppet master host name they initially registered with. Because Oracle Big Data Appliance uses an internal network for communication within the rack, the puppet master host name resolves using /etc/hosts to an internal, private IP address.