Skip Headers
Oracle® Big Data Appliance Owner's Guide
Release 1 (1.0.3)

Part Number E25960-05
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

14 Maintaining Oracle Big Data Appliance

This chapter describes how to monitor and maintain Oracle Big Data Appliance. Some of these procedures use the dcli utility to execute commands in parallel on all servers.

This chapter contains the following sections:

See Also:

Chapter 7, "Using the dcli Utility"

14.1 Monitoring the Ambient Temperature of Servers

Maintaining environmental temperature conditions within design specification for a Sun Fire server helps to achieve maximum efficiency and targeted component service lifetimes. The effect of validating the ambient temperature range is minimal. The effect of corrective actions varies depending on the environmental conditions.

Temperatures outside the ambient temperature range of 21 to 23 degrees Celsius (70 to 74 degrees Fahrenheit) affect all components within Oracle Big Data Appliance, possibly causing performance problems and shortened service lifetimes.

To monitor the ambient temperature: 

  1. Connect to an Oracle Big Data Appliance server as root.

  2. Set up passwordless SSH for root by issuing the setup-root-ssh command, as described in "Setting Up Passwordless SSH".

  3. Check the current temperature:

    dcli 'ipmitool sunoem cli "show /SYS/T_AMB" | grep value'
    
  4. If any temperature reading is outside the operating range, then investigate and correct the problem. See Table 2-9.

The following is an example of the command output:

bda1node01-adm.example.com: value = 22.000 degree C
bda1node02-adm.example.com: value = 22.000 degree C
bda1node03-adm.example.com: value = 22.000 degree C
bda1node04-adm.example.com: value = 23.000 degree C
          .
          .
          .

14.2 Powering On and Off Oracle Big Data Appliance

This section includes the following topics:

14.2.1 Nonemergency Power Procedures

This section contains the procedures for powering on and off the components of Oracle Big Data Appliance in an orderly fashion. This section contains the following topics:

14.2.1.1 Powering On Oracle Big Data Appliance

Oracle Big Data Appliance is powered on by either pressing the power button on the front of the servers, or by logging in to the Oracle ILOM interface and applying power to the system.

To power on Oracle Big Data Appliance: 

  1. Turn on all 12 breakers on both PDUs.

    Allow 1 to 2 minutes for Oracle ILOM to start.

  2. Power up the servers.

14.2.1.2 Powering On Servers Remotely Using Oracle ILOM

You can power on the servers remotely using the Oracle ILOM interface. You can access Oracle ILOM using the web console, the command-line interface (CLI), the intelligent platform management interface (IPMI), or the simple network management protocol interface (SNMP). For example, to apply power to server bda1node01 using IPMI, run the following command as root from a server that has ipmitool installed:

ipmitool -H bda1node01-c -U root chassis power on

In this example, bda1node01-c is the host name of Oracle ILOM for the server to be powered on. You are prompted for the password.

See Also:

Oracle Integrated Lights Out Manager 3.0 documentation for information about using Oracle ILOM to power on the servers:

http://docs.oracle.com/cd/E19860-01/index.html

14.2.1.3 Powering Off Oracle Big Data Appliance

To power off Oracle Big Data Appliance:

  1. Power off the servers.

  2. Turn off all 12 breakers on both PDUs.

14.2.1.3.1 Powering Off the Servers

Use the Linux shutdown command to power off or restart the servers. Issue this command as root to shut down a server immediately:

# shutdown -hP now

The following command restarts a server immediately:

# shutdown -r now

See Also:

Linux SHUTDOWN manual page for details
14.2.1.3.2 Powering Off Multiple Servers at the Same Time

Use the dcli utility to run the shutdown command on multiple servers at the same time. Do not run the dcli utility from a server that will be shut down. Set up passwordless SSH for root, as described in "Setting Up Passwordless SSH".

The following command shows the syntax of the command:

dcli -l root -g group_name shutdown -hP now

In this command, group_name is a file that contains a list of servers.

The following example shuts down all Oracle Big Data Appliance servers listed in the server_group file:

dcli -l root -g server_group shutdown -hP now

14.2.1.4 Powering On and Off Network Switches

The network switches do not have power switches. They power off when power is removed by turning off a PDU or a breaker in the data center.

14.2.2 Emergency Power-Off Considerations

In an emergency, halt power to Oracle Big Data Appliance immediately. The following emergencies may require powering off Oracle Big Data Appliance:

  • Natural disasters such as earthquake, flood, hurricane, tornado, or cyclone

  • Abnormal noise, smell, or smoke coming from the system

  • Threat to human safety

14.2.2.1 Emergency Power-Off Procedure

To perform an emergency power-off procedure for Oracle Big Data Appliance, turn off power at the circuit breaker or pull the emergency power-off switch in the computer room. After the emergency, contact Oracle Support Services to restore power to the system.

14.2.2.2 Emergency Power-Off Switch

Emergency power-off (EPO) switches are required when computer equipment contains batteries capable of supplying more than 750 volt-amperes for more than 5 minutes. Systems that have these batteries include internal EPO hardware for connection to a site EPO switch or relay. Use of the EPO switch removes power from Oracle Big Data Appliance.

14.2.3 Cautions and Warnings

The following cautions and warnings apply to Oracle Big Data Appliance:

  • Do not touch the parts of this product that use high-voltage power. Touching them might result in serious injury.

  • Do not power off Oracle Big Data Appliance unless there is an emergency. In that case, follow the "Emergency Power-Off Procedure".

  • Keep the front and rear cabinet doors closed. Failure to do so might cause system failure or result in damage to hardware components.

  • Keep the top, front, and back of the cabinets clear to allow proper airflow and prevent overheating of components.

  • Use only the supplied hardware.

14.3 Adding Memory to a Server

Oracle Big Data Appliance ships from the factory with 48 GB of memory. Six of the 18 DIMM slots are populated with 8 GB DIMMs. You can use populate six empty slots with 8 GB DIMMs to bring the total memory to either 96 GB (12 x 8 GB) or 144 GB (18 x 8 GB). An upgrade to 144 GB may slightly reduce performance because of lower memory bandwidth; memory frequency drops from 1333 MHz to 800 MHz.

You can add memory to all nodes in the cluster or just to specific nodes that need more memory, such as the NameNode (server 01).

To add memory to a Sun Fire server:  

  1. Power down the server.

  2. Replace the plastic fillers with the six DIMMs as described in the Sun Fire X4270 M2 Server Service Manual at http://docs.oracle.com/cd/E19245-01.

  3. Power on the server.

14.4 Maintaining the Physical Disks of Servers

Repair of the physical disks does not require shutting down Oracle Big Data Appliance. However, individual servers may be taken outside of the cluster temporarily and require downtime.

See Also:

"Parts for Sun Fire Servers" for the repair procedures

14.4.1 Verifying the Server Configuration

The 12 disk drives in each Oracle Big Data Appliance server are controlled by an LSI MegaRAID SAS 92610-8i disk controller. Oracle recommends verifying the status of the RAID devices to avoid possible performance impact or an outage. The effect of validating the RAID devices is minimal. The effect of corrective actions vary depending on the specific issue uncovered and may range from simple reconfiguration to an outage.

14.4.1.1 Verifying Disk Controller Configuration

Issue this command to verify the disk controller configuration:

# MegaCli64 -AdpAllInfo -a0 | grep "Device Present" -A 8

The following is an example of the output from the command. There should be 12 virtual drives, no degraded or offline drives, and 14 physical devices. The 14 devices are the controllers and the 12 disk drives.

Device Present
                ================
Virtual Drives    : 12 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 0 
  Failed Disks    : 0 

If the output is different, then investigate and correct the problem.

14.4.1.2 Verifying Virtual Drive Configuration

Issue this command to verify the virtual drive configuration:

# MegaCli64 -LDInfo -lAll -a0

Following is an example of the output for Virtual Drive 0. Ensure that State is Optimal.

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 1.817 TB
Parity Size         : 0
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None

14.4.1.3 Verifying Physical Drive Configuration

Use the following command to verify the physical drive configuration:

# MegaCli64 -PDList -a0 | grep Firmware

The following is an example of the output from the command. Twelve drives should be Online, Spun Up. If the output is different, then investigate and correct the problem.

Firmware state: Online, Spun Up
Device Firmware Level: 061A
Firmware state: Online, Spun Up
Device Firmware Level: 061A
Firmware state: Online, Spun Up
Device Firmware Level: 061A
     .
     .
     .

14.5 Changing InfiniBand IP Addresses

You may need to change the InfiniBand network information on an existing Oracle Big Data Appliance. The change may support a media server with multiple InfiniBand cards, or keep InfiniBand traffic on a distinct InfiniBand network such as having production, test, and QA environments in the same rack.

All InfiniBand addresses must be in the same subnet, with a minimum subnet mask of 255.255.240.0 (or /20). Choose a subnet mask wide enough to accommodate possible future expansion of the Oracle Big Data Appliance and InfiniBand network.

You cannot change the host names after running the Mammoth Utility.

To change the InfiniBand IP addresses: 

  1. Log in to an Oracle Big Data Appliance server as the root user.

  2. Change to the /etc/sysconfig/network-scripts directory.

  3. Copy the ifcfg-bondib0 file, using a name that does not start with ifcfg:

    cp ifcfg-bondib0 orig_ifcfg-bondib0
    
  4. Edit ifcfg-bondib0 to update the IPADDR, NETMASK, NETWORK, and BROADCAST fields.

    Following is an example of an ifcfg-bondib0 file:

    IPADDR=192.166.41.26
    NETMASK=255.255.255.0
    NETWORK=192.166.40.0
    BROADCAST=192.166.47.255
    DEVICE=bondib0
    USERCTL=no
    ONBOOT=yes
    TYPE=bonding
    BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100"
    HOTPLUG=no
    IPV6INIT=no
    BOOTPROTO=none
    MTU=65520
    
  5. Restart the server:

    # reboot
    
  6. Verify the new InfiniBand IP address information:

    # iblinkinfo | grep reconfigured_host_name
    

    See "Checking the Health of the Network" for information about the iblinkinfo command.

  7. Make a backup copy of the /etc/hosts file:

    # cp /etc/hosts /etc/orig_hosts
    
  8. Edit the file to change the IP addresses.

  9. Repeat this procedure on all Oracle Big Data Appliance servers.

14.6 Maintaining the InfiniBand Network

The InfiniBand network connects the servers through the bondib0 interface to the InfiniBand switches in the rack. This section describes how to perform maintenance on the InfiniBand switches.

This section contains the following topics:

14.6.1 Backing Up and Restoring Oracle ILOM Switch Settings

Oracle ILOM supports remote administration of the Oracle Big Data Appliance servers. This section explains how to back up and restore the Oracle ILOM configuration settings, which are set by the Mammoth Utility.

This section contains the following topics:

See Also:

Oracle Integrated Lights Out Manager 3.0 documentation at

http://docs.oracle.com/cd/E19860-01/

14.6.1.1 Backing Up Oracle ILOM Configuration Settings

To back up the Oracle ILOM configuration settings:

  1. Open your browser on any system on the same network as Oracle Big Data Appliance and enter the Oracle ILOM address of a server. This example uses the Oracle ILOM address of server 08:

    http://bda1node08-c.example.com

  2. Log in as the ilom-admin user. The default password is welcome1.

  3. Select the Maintenance tab.

  4. Select the Backup/Restore subtab.

  5. Select the Backup operation and the Browser transfer method, as shown in Figure 14-1.

  6. Enter a pass phrase. The phrase is used to encrypt sensitive information, such as passwords, in the backup.

  7. Click Run to start the backup. The results are downloaded to your local system in an XML file named config_backup.xml.

  8. Save the file in a secure location.

  9. Click the Log Out button.

Figure 14-1 Oracle ILOM Configuration Backup

Description of Figure 14-1 follows
Description of "Figure 14-1 Oracle ILOM Configuration Backup"

14.6.1.2 Restoring Oracle ILOM Configuration Settings

To restore the settings on a switch:

  1. Open your browser on any system on the same network as Oracle Big Data Appliance and navigate to an Oracle ILOM on a server. This example uses Oracle ILOM on node08:

    http://bda1node08-c.us.example.com

  2. Log in as the ilom-admin user. The default password is welcome1.

  3. Select the Maintenance tab.

  4. Select the Backup/Restore tab.

  5. Select the Restore operation and the Browser transfer method.

  6. Click Choose File and select the config_backup.xml file saved previously in a backup operation.

  7. Enter the pass phrase that was set during the backup.

  8. Click Run to restore the configuration.

14.6.2 Replacing a Failed InfiniBand Switch

Complete these steps to replace a Sun Network QDR InfiniBand Gateway Switch or a Sun Datacenter InfiniBand Switch 36.

See Also:

To replace a failed InfiniBand switch: 

  1. Disconnect the cables from the switch. All InfiniBand cables have labels at both ends indicating their locations. If any cables do not have labels, then label them.

  2. Power off both power supplies on the switch by removing the power plugs.

  3. Remove the switch from the rack.

  4. Install the new switch in the rack.

  5. Restore the switch settings using the backup, as described in "Backing Up and Restoring Oracle ILOM Switch Settings".

  6. Connect to the switch as ilom_admin and open the Fabric Management shell:

    -> show /SYS/Fabric_Mgmt
    

    The prompt changes from -> to FabMan@hostname->

  7. Disable the Subnet Manager:

    FabMan@bda1sw-02-> disablesm
    
  8. Connect the cables to the new switch, being careful to connect each cable to the correct port.

  9. Verify that there are no errors on any links in the fabric:

    FabMan@bda1sw-02-> ibdiagnet -c 1000 -r
    
  10. Enable the Subnet Manager:

    FabMan@bda1sw-02-> enablesm
    

    Note:

    If the replaced switch was the Sun Datacenter InfiniBand Switch 36 spine switch, then manually fail the master Subnet Manager back to the switch by disabling the Subnet Managers on the other switches until the spine switch becomes the master, then reenable the Subnet Manager on all the other switches.

14.6.3 Verifying InfiniBand Network Operation

If any component in the InfiniBand network has required maintenance, including replacing an InfiniBand Host Channel Adapter (HCA) on a server, an InfiniBand switch, or an InfiniBand cable, or if operation of the InfiniBand network is suspected to be substandard, then verify the InfiniBand network is operating properly. The following procedure describes how to verify network operation:

Note:

Use this procedure used any time the InfiniBand network is performing below expectations.

To verify InfiniBand network operation: 

  1. Issue the ibdiagnet command to verify InfiniBand network quality:

    # ibdiagnet -c 1000
    

    Investigate all errors reported by this command. It generates a small amount of network traffic and can run during a normal workload.

  2. Report switch port error counters and port configuration information. The LinkDowned, RcvSwRelayErrors, XmtDiscards, and XmtWait errors are ignored by this command:

    #  ibqueryerrors.pl -rR -s LinkDowned,RcvSwRelayErrors,XmtDiscards,XmtWait
    
  3. Check the status of the hardware:

    # bdacheckhw
    

    The following is an example of the output:

    [SUCCESS: Correct system model : SUN FIRE X4270 M2 SERVER
    [SUCCESS: Correct processor info : Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
    [SUCCESS: Correct number of types of CPU : 1
    [SUCCESS: Correct number of CPU cores : 24
    [SUCCESS: Sufficient GB of memory (>=48): 48
    [SUCCESS: Correct GB of swap space : 24
    [SUCCESS: Correct BIOS vendor : American Megatrends Inc.
    [SUCCESS: Sufficient BIOS version (>=08080102): 08080102
    [SUCCESS: Recent enough BIOS release date (>=05/23/2011) : 05/23/2011
    [SUCCESS: Correct ILOM version : 3.0.16.10.a r68533
    [SUCCESS: Correct number of fans : 6
    [SUCCESS: Correct fan 0 status : ok
    [SUCCESS: Correct fan 1 status : ok
    [SUCCESS: Correct fan 2 status : ok
    [SUCCESS: Correct fan 3 status : ok
    [SUCCESS: Correct fan 4 status : ok
    [SUCCESS: Correct fan 5 status : ok
    [SUCCESS: Correct number of power supplies : 2
    [1m[34mINFO: Detected Santa Clara Factory, skipping power supply checks
    [SUCCESS: Correct disk controller model : LSI MegaRAID SAS 9261-8i
    [SUCCESS: Correct disk controller firmware version : 12.12.0-0048
    [SUCCESS: Correct disk controller PCI address : 13:00.0
    [SUCCESS: Correct disk controller PCI info : 0104: 1000:0079
    [SUCCESS: Correct disk controller PCIe slot width : x8
    [SUCCESS: Correct disk controller battery type : iBBU08
    [SUCCESS: Correct disk controller battery state : Operational
    [SUCCESS: Correct number of disks : 12
    [SUCCESS: Correct disk 0 model : SEAGATE ST32000SSSUN2.0
    [SUCCESS: Sufficient disk 0 firmware (>=61A): 61A
    [SUCCESS: Correct disk 1 model : SEAGATE ST32000SSSUN2.0
    [SUCCESS: Sufficient disk 1 firmware (>=61A): 61A
              .
              .
              .
    [SUCCESS: Correct disk 10 status : Online, Spun Up No alert
    [SUCCESS: Correct disk 11 status : Online, Spun Up No alert
    [SUCCESS: Correct Host Channel Adapter model : Mellanox Technologies MT26428 ConnectX VPI PCIe 2.0
    [SUCCESS: Correct Host Channel Adapter firmware version : 2.9.1000
    [SUCCESS: Correct Host Channel Adapter PCI address : 0d:00.0
    [SUCCESS: Correct Host Channel Adapter PCI info : 0c06: 15b3:673c
    [SUCCESS: Correct Host Channel Adapter PCIe slot width : x8
    [SUCCESS: Big Data Appliance hardware validation checks succeeded
    
  4. Check the status of the software:

    # bdachecksw
    
    [SUCCESS: Correct OS disk sda partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary
    [SUCCESS: Correct OS disk sdb partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary
    [SUCCESS: Correct data disk sdc partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdd partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sde partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdf partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdg partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdh partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdi partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdj partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdk partition info : 1 ext3 primary
    [SUCCESS: Correct data disk sdl partition info : 1 ext3 primary
    [SUCCESS: Correct software RAID info : /dev/md2 level=raid1 num-devices=2 /dev/md0 level=raid1 num-devices=2
    [SUCCESS: Correct mounted partitions : /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sda4 /u01 ext4 /dev/sdb4 /u02 ext4 /dev/sdc1 /u03 ext4 /dev/sdd1 /u04 ext4 /dev/sde1 /u05 ext4 /dev/sdf1 /u06 ext4 /dev/sdg1 /u07 ext4 /dev/sdh1 /u08 ext4 /dev/sdi1 /u09 ext4 /dev/sdj1 /u10 ext4 /dev/sdk1 /u11 ext4 /dev/sdl1 /u12 ext4
    [SUCCESS: Correct swap partitions : /dev/sdb3 partition /dev/sda3 partition
    [SUCCESS: Correct Linux kernel version : Linux 2.6.32-200.21.1.el5uek
    [SUCCESS: Correct Java Virtual Machine version : HotSpot(TM) 64-Bit Server 1.6.0_29
    [SUCCESS: Correct puppet version : 2.6.11
    [SUCCESS: Correct MySQL version : 5.5.17
    [SUCCESS: All required programs are accessible in $PATH
    [SUCCESS: All required RPMs are installed and valid
    [SUCCESS: Big Data Appliance software validation checks succeeded
    

14.6.4 Understanding the Network Subnet Manager Master

The Subnet Manager manages all operational characteristics of the InfiniBand network, such as the following:

  • Discover the network topology

  • Assign a local identifier to all ports connected to the network

  • Calculate and program switch forwarding tables

  • Monitor changes in the fabric

The InfiniBand network can have multiple Subnet Managers, but only one Subnet Manager is active at a time. The active Subnet Manager is the Master Subnet Manager. The other Subnet Managers are the Standby Subnet Managers. If a Master Subnet Manager is shut down or fails, then a Standby Subnet Manager automatically becomes the Master Subnet Manager.

Each Subnet Manager has a configurable priority. When multiple Subnet Managers are on the InfiniBand network, the Subnet Manager with the highest priority becomes the master Subnet Manager. On Oracle Big Data Appliance, the Subnet Managers on the leaf switches are configured as priority 5, and the Subnet Managers on the spine switches are configured as priority 8.

The following guidelines determine where the Subnet Managers run on Oracle Big Data Appliance:

  • Only run the Subnet Managers on the switches in Oracle Big Data Appliance. Running the Subnet Manager on any other device is not supported.

  • When the InfiniBand network consists of one, two, or three racks cabled together, all switches must run the Subnet Manager. The master Subnet Manager runs on a spine switch.

  • When the InfiniBand network consists of four or more racks cabled together, then only the spine switches run the Subnet Manager. The leaf switches must disable the Subnet Manager.

See Also:

14.7 Changing the NTP Servers

The configuration information for Network Time Protocol (NTP) servers can be changed after the initial setup. The following procedure describes how to change the NTP configuration information for InfiniBand switches, Cisco switches, and Sun Fire servers. Oracle recommends changing each server individually.

To update the Oracle Big Data Appliance servers: 

  1. Stop NTP services on the server.

  2. Update the /etc/ntp.conf file with the IP address of the new NTP server.

  3. Repeat these steps for each server.

To update the InfiniBand switches: 

  1. Log in to the switch as the ilom-admin user.

  2. Follow the instructions in "Setting the Time Zone on an InfiniBand Switch".

To update the Cisco Ethernet switch: 

  1. Use telnet to connect to the Cisco Ethernet switch.

  2. Delete the current setting:

    # configure terminal
    Enter configuration commands, one per line. End with CNTL/Z.
    (config)# no ntp server current_IPaddress
    
  3. Enter the new IP address:

    # configure terminal
    Enter configuration commands, one per line. End with CNTL/Z.
    (config)# ntp server new_IPaddress
    
  4. Save the current configuration:

    # copy running-config startup-config
    
  5. Exit from the session:

    # exit
    

Restart Oracle Big Data Appliance after changing the servers and switches.