Skip Headers
Oracle® Big Data Appliance Software User's Guide
Release 1 (1.0)

Part Number E25961-04
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

2 Administering Oracle Big Data Appliance

This chapter provides information about the software and services installed on Oracle Big Data Appliance. It contains these topics:

2.1 Managing CDH Operations

Cloudera Manager is installed on Oracle Big Data Appliance to help you with Cloudera's Distribution including Apache Hadoop (CDH) operations. Cloudera Manager provides a single administrative interface to all Oracle Big Data Appliance servers configured as part of the Hadoop cluster.

Cloudera Manager simplifies the performance of these administrative tasks:

Cloudera Manager runs on node02 and is available on port 7180.

To use Cloudera Manager: 

  1. Open a browser and enter a URL like the following:

    http://bda1node02.example.com:7180
    

    In this example, bda1 is the name of the appliance, node02 is the name of the server, example.com is the domain, and 7180 is the default port number for Cloudera Manager.

  2. Log in with a user name and password for Cloudera Manager. Only a user with administrative privileges can change the settings. Other Cloudera Manager users can view the status of Oracle Big Data Appliance.

See Also:

Cloudera Manager User Guide at http://oracle.cloudera.com/ or click Help on the Cloudera Manager Help menu.

2.1.1 Monitoring the Status of Oracle Big Data Appliance

In Cloudera Manager, you can choose the page from the Navigation Bar across the top of the display:

  • Services: Monitors the status and health of services running on Oracle Big Data Appliance. Click the name of a service to drill down to additional information.

  • Hosts: Monitors the health, disk usage, load, physical memory, swap space, and so forth of all servers.

  • Activities: Monitors all MapReduce jobs running in the selected time period.

  • Logs: Collects historical information about the systems and services. You can search for a particular phrase for a selected server, service, and time period. You can also select the minimum severity level of the logged messages included in the search: TRACE, DEBUG, INFO, WARN, ERROR, or FATAL.

  • Events: Records a change in state and other noteworthy occurrences. You can search for one or more keywords for a selected server, service, and time period. You can also select the event types: Audit Event, Activity Event, Health Check, or Log Message.

  • Reports: Generates reports on demand for disk and MapReduce use.

Figure 2-1 shows the opening display of Cloudera Manager, which is the Services page.

Figure 2-1 Cloudera Manager Services Page

Description of Figure 2-1 follows
Description of "Figure 2-1 Cloudera Manager Services Page"

2.1.2 Performing Administrative Tasks

As a Cloudera Manager administrator, you can change various properties for monitoring the health and use of Oracle Big Data Appliance, add users, and set up Kerberos security.

To access Cloudera Manager Administration: 

  1. Log in to Cloudera Manager with administrative privileges.

  2. Click Welcome admin at the top right of the page.

2.1.3 Collecting Diagnostic Information

If you need help from Oracle Support to troubleshoot CDH issues, then you should first collect diagnostic information using Cloudera Manager.

To collect diagnostic information about CDH: 

  1. Log in to Cloudera Manager with administrative privileges.

  2. From the Help menu, click Send Diagnostic Data.

  3. Verify that Send Diagnostic Data to Cloudera Automatically is not selected. Keep the other default settings.

  4. Click Collect Host Statistics Globally.

  5. Wait while all statistics are collected on all nodes.

  6. Click Download Result Data and save the ZIP file with the default name. It identifies your CDH license.

  7. Go to My Oracle Support at http://support.oracle.com.

  8. Open a Service Request (SR) if you have not already done so.

  9. Upload the ZIP file into the SR. If the file is too large, then upload it to ftp.oracle.com, as described in the next procedure.

To upload the diagnostics to ftp.oracle.com: 

  1. Open an FTP client and connect to ftp.oracle.com.

    You can use an FTP client such as WinSCP4 to upload the ZIP file. See Example 2-1 if you are using a command-line FTP client.

  2. Log in as user anonymous and leave the password blank.

  3. In the bda/incoming directory, create a directory using the SR number for the name, in the format SRnumber. The resulting directory structure looks like this:

    bda
       incoming
          SRnumber
    
  4. Set the binary option to prevent corruption of binary data.

  5. Upload the diagnostics ZIP file to the bin directory.

  6. Update the SR with the full path and file name.

Example 2-1 shows the commands to upload the diagnostics using the Windows FTP command interface.

Example 2-1 Uploading Diagnostics Using Windows FTP

ftp> open ftp.oracle.com
Connected to bigip-ftp.oracle.com.
220-***********************************************************************
220-Oracle FTP Server
         .
         .
         .
220-****************************************************************************
 
220
User (bigip-ftp.oracle.com:(none)): anonymous
331 Please specify the password.
Password:
230 Login successful.
ftp> cd bda/incoming
250 Directory successfully changed.
ftp> mkdir SR12345
257 "/bda/incoming/SR12345" created
ftp> cd SR12345
250 Directory successfully changed.
ftp> bin
200 Switching to Binary mode.
ftp> put D:\Downloads\3609df...c1.default.20122505-15-27.host-statistics.zip
200 PORT command successful. Consider using PASV.
150 Ok to send data.
226 File receive OK.
ftp: 706755 bytes sent in 1.97Seconds 358.58Kbytes/sec.

2.2 Using Hadoop Monitoring Utilities

Users can monitor MapReduce jobs without providing a Cloudera Manager user name and password.

2.2.1 Monitoring the JobTracker

Hadoop Map/Reduce Administration monitors the JobTracker, which runs on port 50030 of node03 on Oracle Big Data Appliance.

To monitor the JobTracker: 

  • Open a browser and enter a URL like the following:

    http://bda1node03.example.com:50030
    

    In this example, bda1 is the name of the appliance, node03 is the name of the server, and 50030 is the default port number for Hadoop Map/Reduce Administration.

Figure 2-2 shows part of a Hadoop Map/Reduce Administration display.

Figure 2-2 Hadoop Map/Reduce Administration

Description of Figure 2-2 follows
Description of "Figure 2-2 Hadoop Map/Reduce Administration"

2.2.2 Monitoring the TaskTracker

The Task Tracker Status interface is available on port 50060 of node04 -node18 on Oracle Big Data Appliance.

To monitor the TaskTracker: 

  • Open a browser and enter a URL like the following:

    http://bda1node13.example.com:50060
    

    In this example, bda1 is the name of the rack, node13 is the name of the server, and 50060 is the default port number for Task Tracker Status.

Figure 2-3 shows the TaskTracker.

Figure 2-3 Task Tracker Status

Description of Figure 2-3 follows
Description of "Figure 2-3 Task Tracker Status"

2.3 Providing Remote Client Access to CDH

Oracle Big Data Appliance supports full local access to all commands and utilities in Cloudera's Distribution including Apache Hadoop (CDH).

You can use a browser on any computer on the same network as Oracle Big Data Appliance to access Cloudera Manager, Hadoop Map/Reduce Administration, Hadoop Task Tracker UI, and other browser-based Hadoop tools.

To issue Hadoop commands remotely, however, you must connect from a system configured as a CDH client. This chapter explains how to set up a computer so you can access HDFS and submit MapReduce jobs on Oracle Big Data Appliance.

To follow these procedures, you must have these access privileges:

If you do not have these access privileges, then contact your system administrator for help.

2.3.1 Installing CDH on the Client System

The system that you use to access Oracle Big Data Appliance must run Oracle Linux 5 or a compatible Linux distribution, that is, one that permits installation of Oracle Linux 5 RPMs. You must install the same version of CDH that Oracle Big Data Appliance runs, or CDH3u4 or later.

To install the CDH client software: 

  1. Log in to the Linux system as root and change to the /tmp directory.

    cd /tmp
    
  2. Perform a secure copy of the Hadoop client RPM to the /tmp directory:

    scp username@bda_node_name:/opt/hadoop/client/*.rpm .
    

    Or, to use sftp instead of scp:

    1. Open a secure connection to any server in Oracle Big Data Appliance:

      sftp username@bda_node_name
      
    2. Copy the RPM file:

      get /opt/hadoop/client/*.rpm
      
    3. Close the SFTP connection:

      quit
      
  3. Ensure that no Hadoop client currently exists on your system:

    rpm -qa | grep hadoop
    

    If you see just the prompt, then no Hadoop client is installed, and you can continue with the next step.

    If the command returns a value, then remove that version:

    rpm -e hadoop-version
    
  4. Install the new CDH client:

    rpm -ihv hadoop-version
    

Example 2-2 illustrates the previous steps. It uses scp to copy hadoop-0.20-0.20.2+923.202-1.noarch.rpm from bda1node09, removes an older version of Hadoop, and installs the new version.

Example 2-2 Installing the CDH Client Software

[root@client]$ cd /tmp
[root@client]$ scp username@bda1node09.example.com:/opt/hadoop/client/*rpm .
username@bda1node09.example.com's password:
hadoop-0.20-0.20.2+923.202-1.noarch.rpm 100% 30MB 10.0MB/s 00:03 
[root@client]$ rpm -qa | grep hadoop
hadoop-0.20-0.20.2+923.194-1
[root@client]$ rpm -e hadoop-0.20-0.20.2+923.194-1 
[root@client]$ rpm -ihv hadoop-0.20-0.20.2+923.202-1.noarch.rpm
warning: hadoop-0.20-0.20.2+923.202-1.noarch.rpm: Header V4 DSA signature: NOKEY, key ID e8f86acd
Preparing...                ########################################### [100%]
   1:hadoop-0.20            ########################################### [100%]

2.3.2 Configuring CDH

After installing CDH, you must configure it for use with Oracle Big Data Appliance.

To configure the Hadoop client: 

  1. Open a browser on your client system and connect to Cloudera Manager. It runs on node02 and listens on port 7180, as shown in this example:

    http://bda1node02.example.com:7180
    
  2. Log in as admin.

  3. Cloudera Manager opens on the Services tab. Click the Generate Client Configuration button.

  4. On the Command Details page (shown in Figure 2-4), click Download Result Data to download global-clientconfig.zip.

  5. Unzip global-clientconfig.zip into the /tmp directory on the client system. It creates a hadoop-conf directory containing these files:

    core-site.xml
    hadoop-env.sh
    hdfs-site.xml
    log4j.properties
    mapred-site.xml
    README.txt
    ssl-client.xml.example
    
  6. Open hadoop-env.sh in a text editor and change JAVA_HOME to the correct location on your system:

    export JAVA_HOME=full_directory_path
    
  7. Delete the hash mark (#) to uncomment the line, then save the file.

  8. Copy the configuration files to the Hadoop conf directory:

    cd /tmp/hadoop-conf
    cp * /usr/lib/hadoop/conf/
    
  9. Validate the installation by changing to the mapred user and submitting a MapReduce job, such as the one shown here:

    su mapred
    hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 10 1000000
    

Figure 2-4 shows the download page for the client configuration.

Figure 2-4 Cloudera Manager Command Details: GenerateClient Page

Description of Figure 2-4 follows
Description of "Figure 2-4 Cloudera Manager Command Details: GenerateClient Page"

2.4 Managing User Accounts

Every open source package installed on Oracle Big Data Appliance creates one or more users and groups. Most of these users do not have login privileges, shells, or home directories. They are used by daemons and are not intended as an interface for individual users. For example, Hadoop operates as the hdfs user, MapReduce operates as mapred, and Hive operates as hive. Table 2-1 identifies the operating system users and groups that are created automatically during installation of Oracle Big Data Appliance Software for use by CDH components and other software packages.

You can use the oracle identity to run Hadoop and Hive jobs immediately after the Oracle Big Data Appliance software is installed. This user account has login privileges, a shell, and a home directory. Oracle NoSQL Database and Oracle Data Integrator run as the oracle user. Its primary group is oinstall.

Note:

Do not delete or modify the users created during installation, because they are required for the software to operate.

When creating additional user accounts, define them as follows:

Table 2-1 Operating System Users and Groups

User Name Group Used By Login Rights

flume

flume

Flume parent and nodes

No

hbase

hbase

HBase processes

No

hdfs

hadoop

NameNode, DataNode

No

hive

hive

Hive metastore and server processes

No

hue

hue

Hue processes

No

mapred

hadoop

JobTracker, TaskTracker, Hive Thrift daemon

Yes

mysql

mysql

MySQL Server

Yes

oozie

oozie

Oozie server

No

oracle

dba, oinstall

Oracle NoSQL Database, Oracle DBA, Oracle Loader for Hadoop, Oracle Data Integrator

Yes

puppet

puppet

Puppet parent (puppet nodes run as root)

No

sqoop

sqoop

Sqoop metastore

No

svctag

--

Auto Service Request

No

zookeeper

zookeeper

Zookeeper processes

No


2.5 Software Layout

The following sections identify the software installed on Oracle Big Data Appliance and where it runs in the rack. Some components operate with Oracle Database 11.2.0.2 and later releases.

2.5.1 Software Components

These software components are installed on all 18 servers in Oracle Big Data Appliance Rack. Oracle Linux, required drivers, firmware, and hardware verification utilities are factory installed. All other software is installed on site using the Mammoth Utility.

Note:

You do not need to install software on Oracle Big Data Appliance. Doing so may result in a loss of warranty and support. See the Oracle Big Data Appliance Owner's Guide.

Installed software: 

  • Oracle Linux 5.6

  • Java HotSpot Virtual Machine 6 Update 29

  • Cloudera's Distribution including Apache Hadoop Release 3 Update 3 (CDH)

  • Cloudera Manager 3.7

  • Oracle Loader for Hadoop 1.1

  • Oracle NoSQL Database Community Edition 11g Release 1.2.123

  • Oracle Data Integrator Agent 1.1.1.6

  • Oracle R Connector for Hadoop 1.0

  • R distribution 2.13.2

  • Oracle Direct Connector for Hadoop Distributed File System 1.0

  • Oracle Instant Client 11.2.0.3

  • MySQL Database SE 5.5.18

See Also:

Oracle Big Data Appliance Owner's Guide for information about the Mammoth Utility.

Figure 2-5 shows the relationships among the major components.

Figure 2-5 Major Software Components of Oracle Big Data Appliance

Description of Figure 2-5 follows
Description of "Figure 2-5 Major Software Components of Oracle Big Data Appliance"

2.5.2 Logical Disk Layout

Each server has 12 disks. The critical information is stored on disks 1 and 2.

Table 2-2 describes how the disks are partitioned.

Table 2-2 Logical Disk Layout

Disk Description

1 to 2

150 GB mirrored, physical and logical partition with the Linux operating system, all installed software, NameNode data, and MySQL data, for a total of four copies

2.8 TB HDFS data partition

3

Single Oracle NoSQL Database partition, if activated during software installation; otherwise, a single HDFS data partition

4 to 12

Single HDFS data partition


2.6 Software Services

This section identifies the services, where they run, and their default status. Services that are always on are required for normal operation. Services that you can switch on and off are optional.

You can use Cloudera Manager view the services.

To view the services: 

  1. In Cloudera Manager, click the Services tab at the top of the page to display the Services page.

  2. Click the name of a service to see its detail pages. The service opens on the Status page.

  3. Click the link to the page you want to view: Status, Instances, Commands, Configuration, or Audits.

2.6.1 Parent Services

Table 2-3 describes the parent services and those that run without child services.

Table 2-3 Parent Services

Service Role Description Default Status

hbase

--

HBase database

OFF

hdfs1

NameNode

Tracks all files stored in the cluster.

Always ON

hdfs1

Secondary NameNode

Tracks information for the NameNode

Always ON

hdfs1

Balancer

Periodically issues the balancer command; although the balancer service is enabled, it does not run all the time

Always ON

hive

--

Hive data warehouse for Hadoop

Always ON

hue1

Hue Server

GUI for HDFS, MapReduce, and Hive, with shells for Pig, Flume, and HBase

Always ON

mapreduce1

JobTracker

Used by MapReduce

Always ON

mgmt1

all

Cloudera Manager

Always ON

MySQL

--

MySQL Master Database

ON

ODI Agent

--

Oracle Data Integrator agent, installed on same node as MySQL Database

ON

oozie

--

Workflow and coordination service for Hadoop

OFF

ZooKeeper

--

ZooKeeper coordination service

OFF


2.6.2 Child Services

Table 2-4 describes the child services.

Table 2-4 Child Services

Service Role Description Default Status

HBase Region Server

--

Hosts data and processes requests for HBase

OFF

hdfs1

DataNode

Stores data in HDFS

Always ON

mapreduce1

TaskTracker

Accepts tasks from the JobTracker

Always ON

NoSQL DB Storage Node

--

Supports Oracle NoSQL Database

ON

nosqldb

--

Supports a web console or command-line interface for administering Oracle NoSQL Database

ON


2.6.3 Software Services Distribution

All services are installed on all servers, but individual services run only on designated servers.

2.6.3.1 Service Locations

Table 2-5 identifies the nodes where the services run. Services cannot be run on different nodes in this release, so do not attempt to change this configuration.

Table 2-5 Software Service Locations

Service Node

Balancer

Node01

Beeswax Server

Node03

Cloudera Manager Agents

All nodes

Cloudera Manager SCM Server

Node02

Datanode

All nodes

Hive Server

Node03

Hue Server

Node03

JobTracker

Node03

MySQL Backup

Node02

MySQL Primary Server

Node03

NameNode

Node01

Oracle Data Integrator AgentFoot 1 

Node03

Oracle NoSQL Database AdministrationFootref 1

Node02

Oracle NoSQL Database Server ProcessesFootref 1

All nodes

Puppet Agents

All nodes

Puppet Master

Node01

Secondary NameNode

Node02

TaskTracker

Node04 to Node18


Footnote 1 Started only if requested in the Oracle Big Data Appliance Configuration Worksheets

2.6.3.2 NameNode

The NameNode is the most critical process because it keeps track of the location of all data. Without a healthy NameNode, the entire cluster fails. This vulnerability is intrinsic to Apache Hadoop (v0.20.2 and earlier).

Oracle protects against catastrophic failure by maintaining four copies of the NameNode logs:

  • Node01: Working copy of the NameNode snapshot and update logs is stored in /opt/hadoop/dfs/ and is automatically mirrored in a local Linux partition.

  • Node02: Backup copy of the logs is stored in /opt/shareddir/ and is also automatically mirrored in a local Linux partition.

A fifth backup outside of Oracle Big Data Appliance can be configured during the software installation.

Note:

The Secondary NameNode is not a backup of the primary NameNode and does not provide failover. The Secondary NameNode performs memory-intensive functions for the primary NameNode.

2.6.3.3 Unconfigured Software

The following tools are installed but not configured. Before using them, you must configure them for your use.

  • Flume

  • Mahout

  • Oozie

  • Sqoop

  • Whirr

See Also:

CDH3 Installation and Configuration Guide for configuration procedures at

http://oracle.cloudera.com

2.7 Effects of Hardware on Software Availability

The effects of a server failure vary depending on the server's function within the CDH cluster. Sun Fire servers are more robust than commodity hardware, so you should experience fewer hardware failures. This section highlights the most important services that run on the various servers. For a full list, see "Service Locations".

2.7.1 Node01: Critical for All Services

Node01 is critically important because it is where the NameNode runs. If this server fails, the effect is downtime for the entire cluster, because the NameNode keeps track of the data locations. However, there are always four copies of the NameNode metadata on Oracle Big Data Appliance, plus an optional NFS backup.

The current state and update logs are written to these locations:

  • Node01: /opt/hadoop/dfs/ on Disk 1 is the working copy with a local, operating system, mirrored partition on Disk 2 providing a second copy.

  • Node04: /opt/shareddir/ on Disk 1 is the third copy, which is also duplicated on a mirrored partition on Disk 2.

2.7.2 Node02 to Node03: Required for Some Services

The cluster continues to function after a loss of node2 or node03, but with a loss of some services that might be critical to your operation. The disruptions are in these areas:

Node02: 

  • Cloudera Manager: This tool provides central management for the entire CDH cluster. Without this tool, you can still monitor activities using the utilities described in "Using Hadoop Monitoring Utilities".

  • Oracle NoSQL Database: This database is an optional component of Oracle Big Data Appliance, so the extent of the disruption depends on whether you are using it and how critical it is to your applications.

Node03: 

  • Oracle Data Integrator: This service supports Oracle Data Integrator Application Adapter for Hadoop. You cannot use this connector when node03 is down.

  • MySQL Master Database: Cloudera Manager, Oracle Data Integrator, Hive, and Oozie use MySQL Database. The data is replicated automatically, but you cannot access it when the master database server, which runs on node03, is down.

  • JobTracker: Assigns MapReduce tasks to specific nodes in the CDH cluster.

2.7.3 Node04 to Node18: Optional for All Services

Node04 to node18 are optional in that Oracle Big Data Appliance continues to operate with no loss of service if a failure occurs. The NameNode automatically replicates the lost data to maintain three copies at all times. MapReduce jobs execute on copies of the data stored elsewhere in the cluster. The only loss is in computational power, because there are fewer servers on which to distribute the work.

Node04 stores two duplicate copies of the critical NameNode data, but a loss of this backup does not affect operation of the NameNode.

2.8 Security on Oracle Big Data Appliance

This section identifies security vulnerabilities and discusses the precautions you can take to prevent unauthorized use of the software and data on Oracle Big Data Appliance. It consists of these subsections:

2.8.1 CDH Security

Apache Hadoop is not an inherently secure system. It is protected only by network security. After a connection is established, a client has full access to the system.

Cloudera's Distribution including Apache Hadoop (CDH) supports Kerberos network authentication protocol to prevent malicious impersonation. You must install and configure Kerberos and set up a Kerberos Key Distribution Center and realm. Then you configure various components of CDH to use Kerberos.

CDH provides these securities when configured to use Kerberos:

  • The CDH master nodes, NameNode, and JobTracker resolve the group name so that users cannot manipulate their group memberships.

  • Map tasks run under the identity of the user who submitted the job.

  • Authorization mechanisms in HDFS and MapReduce help control user access to data.

See Also:

http://oracle.cloudera.com for these manuals:
  • CDH3 Security Guide

  • Configuring Hadoop Security with Cloudera Manager

  • Configuring TLS Security for Cloudera Manager

2.8.2 Port Numbers Used on Oracle Big Data Appliance

Table 2-6 identifies the port numbers that may be used in addition to those used by CDH. For the full list of CDH port numbers, go to the Cloudera website at

http://ccp.cloudera.com/display/CDHDOC/Configuring+Ports+for+CDH3

To view the ports used on a particular server: 

  1. In Cloudera Manager, click the Hosts tab at the top of the page to display the Hosts page.

  2. In the Name column, click a server link to see its detail page.

  3. Scroll down to the Ports section.

See Also:

The Cloudera website for CDH port numbers:

Table 2-6 Oracle Big Data Appliance Port Numbers

Service Port

Automated Service Monitor (ASM)

30920

MySQL Database

3306

Oracle Data Integrator Agent

20910

Oracle NoSQL Database administration

5001

Oracle NoSQL Database processes

5010 to 5020

Oracle NoSQL Database registration

5000

Port map

111

Puppet master service

8140

Puppet node service

8139

rpc.statd

668

ssh

22

xinetd (service tag)

6481


2.8.3 Security of Software Components

Following are configuration details about the software components and any special security precautions they require.

2.8.3.1 Puppet

The puppet node service (puppetd) runs continuously as root on all servers. It listens on port 8139 for "kick" requests, which trigger it to request updates from the puppet master. It does not receive updates on this port.

The puppet master service (puppetmasterd) runs continuously as the puppet user on the first server of the primary Oracle Big Data Appliance rack. It listens on port 8140 for requests to push updates to puppet nodes.

The puppet nodes generate and send certificates to the puppet master to register initially during installation of the software. For updates to the software, the puppet master signals ("kicks") the puppet nodes, which then request all configuration changes from the puppet master node that they are registered with.

The puppet master sends updates only to puppet nodes that have known, valid certificates. Puppet nodes only accept updates from the puppet master host name they initially registered with. Because Oracle Big Data Appliance uses an internal network for communication within the rack, the puppet master host name resolves using /etc/hosts to an internal, private IP address.