|Oracle® Big Data Appliance Owner's Guide
Release 1 (1.0.3)
Part Number E25960-05
|PDF · Mobi · ePub|
This chapter describes how to monitor and maintain Oracle Big Data Appliance. Some of these procedures use the
dcli utility to execute commands in parallel on all servers.
This chapter contains the following sections:
See Also:Chapter 7, "Using the dcli Utility"
Maintaining environmental temperature conditions within design specification for a Sun Fire server helps to achieve maximum efficiency and targeted component service lifetimes. The effect of validating the ambient temperature range is minimal. The effect of corrective actions varies depending on the environmental conditions.
Temperatures outside the ambient temperature range of 21 to 23 degrees Celsius (70 to 74 degrees Fahrenheit) affect all components within Oracle Big Data Appliance, possibly causing performance problems and shortened service lifetimes.
To monitor the ambient temperature:
Connect to an Oracle Big Data Appliance server as
Set up passwordless SSH for
root by issuing the
setup-root-ssh command, as described in "Setting Up Passwordless SSH".
Check the current temperature:
dcli 'ipmitool sunoem cli "show /SYS/T_AMB" | grep value'
If any temperature reading is outside the operating range, then investigate and correct the problem. See Table 2-9.
The following is an example of the command output:
bda1node01-adm.example.com: value = 22.000 degree C bda1node02-adm.example.com: value = 22.000 degree C bda1node03-adm.example.com: value = 22.000 degree C bda1node04-adm.example.com: value = 23.000 degree C . . .
This section includes the following topics:
This section contains the procedures for powering on and off the components of Oracle Big Data Appliance in an orderly fashion. This section contains the following topics:
Oracle Big Data Appliance is powered on by either pressing the power button on the front of the servers, or by logging in to the Oracle ILOM interface and applying power to the system.
To power on Oracle Big Data Appliance:
Turn on all 12 breakers on both PDUs.
Allow 1 to 2 minutes for Oracle ILOM to start.
Power up the servers.
You can power on the servers remotely using the Oracle ILOM interface. You can access Oracle ILOM using the web console, the command-line interface (CLI), the intelligent platform management interface (IPMI), or the simple network management protocol interface (SNMP). For example, to apply power to server bda1node01 using IPMI, run the following command as
root from a server that has
ipmitool -H bda1node01-c -U root chassis power on
In this example, bda1node01-c is the host name of Oracle ILOM for the server to be powered on. You are prompted for the password.
See Also:Oracle Integrated Lights Out Manager 3.0 documentation for information about using Oracle ILOM to power on the servers:
Power off the servers.
Turn off all 12 breakers on both PDUs.
# shutdown -hP now
The following command restarts a server immediately:
# shutdown -r now
See Also:Linux SHUTDOWN manual page for details
dcli utility to run the
shutdown command on multiple servers at the same time. Do not run the
dcli utility from a server that will be shut down. Set up passwordless SSH for
root, as described in "Setting Up Passwordless SSH".
The following command shows the syntax of the command:
dcli -l root -g group_name shutdown -hP now
In this command, group_name is a file that contains a list of servers.
The following example shuts down all Oracle Big Data Appliance servers listed in the server_group file:
dcli -l root -g server_group shutdown -hP now
See Also:Chapter 7, "Using the dcli Utility"
The network switches do not have power switches. They power off when power is removed by turning off a PDU or a breaker in the data center.
In an emergency, halt power to Oracle Big Data Appliance immediately. The following emergencies may require powering off Oracle Big Data Appliance:
Natural disasters such as earthquake, flood, hurricane, tornado, or cyclone
Abnormal noise, smell, or smoke coming from the system
Threat to human safety
To perform an emergency power-off procedure for Oracle Big Data Appliance, turn off power at the circuit breaker or pull the emergency power-off switch in the computer room. After the emergency, contact Oracle Support Services to restore power to the system.
Emergency power-off (EPO) switches are required when computer equipment contains batteries capable of supplying more than 750 volt-amperes for more than 5 minutes. Systems that have these batteries include internal EPO hardware for connection to a site EPO switch or relay. Use of the EPO switch removes power from Oracle Big Data Appliance.
The following cautions and warnings apply to Oracle Big Data Appliance:
Do not touch the parts of this product that use high-voltage power. Touching them might result in serious injury.
Do not power off Oracle Big Data Appliance unless there is an emergency. In that case, follow the "Emergency Power-Off Procedure".
Keep the front and rear cabinet doors closed. Failure to do so might cause system failure or result in damage to hardware components.
Keep the top, front, and back of the cabinets clear to allow proper airflow and prevent overheating of components.
Use only the supplied hardware.
Oracle Big Data Appliance ships from the factory with 48 GB of memory. Six of the 18 DIMM slots are populated with 8 GB DIMMs. You can use populate six empty slots with 8 GB DIMMs to bring the total memory to either 96 GB (12 x 8 GB) or 144 GB (18 x 8 GB). An upgrade to 144 GB may slightly reduce performance because of lower memory bandwidth; memory frequency drops from 1333 MHz to 800 MHz.
You can add memory to all nodes in the cluster or just to specific nodes that need more memory, such as the NameNode (server 01).
To add memory to a Sun Fire server:
Power down the server.
Replace the plastic fillers with the six DIMMs as described in the Sun Fire X4270 M2 Server Service Manual at
Power on the server.
Repair of the physical disks does not require shutting down Oracle Big Data Appliance. However, individual servers may be taken outside of the cluster temporarily and require downtime.
See Also:"Parts for Sun Fire Servers" for the repair procedures
The 12 disk drives in each Oracle Big Data Appliance server are controlled by an LSI MegaRAID SAS 92610-8i disk controller. Oracle recommends verifying the status of the RAID devices to avoid possible performance impact or an outage. The effect of validating the RAID devices is minimal. The effect of corrective actions vary depending on the specific issue uncovered and may range from simple reconfiguration to an outage.
Issue this command to verify the disk controller configuration:
The following is an example of the output from the command. There should be 12 virtual drives, no degraded or offline drives, and 14 physical devices. The 14 devices are the controllers and the 12 disk drives.
Device Present ================ Virtual Drives : 12 Degraded : 0 Offline : 0 Physical Devices : 14 Disks : 12 Critical Disks : 0 Failed Disks : 0
If the output is different, then investigate and correct the problem.
# MegaCli64 -LDInfo -lAll -a0
Following is an example of the output for Virtual Drive 0. Ensure that State is Optimal.
Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 Size : 1.817 TB Parity Size : 0 State : Optimal Strip Size : 64 KB Number Of Drives : 1 Span Depth : 1 Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disk's Default Encryption Type : None
Use the following command to verify the physical drive configuration:
# MegaCli64 -PDList -a0 | grep Firmware
The following is an example of the output from the command. Twelve drives should be Online, Spun Up. If the output is different, then investigate and correct the problem.
Firmware state: Online, Spun Up Device Firmware Level: 061A Firmware state: Online, Spun Up Device Firmware Level: 061A Firmware state: Online, Spun Up Device Firmware Level: 061A . . .
You may need to change the InfiniBand network information on an existing Oracle Big Data Appliance. The change may support a media server with multiple InfiniBand cards, or keep InfiniBand traffic on a distinct InfiniBand network such as having production, test, and QA environments in the same rack.
All InfiniBand addresses must be in the same subnet, with a minimum subnet mask of 255.255.240.0 (or /20). Choose a subnet mask wide enough to accommodate possible future expansion of the Oracle Big Data Appliance and InfiniBand network.
You cannot change the host names after running the Mammoth Utility.
To change the InfiniBand IP addresses:
Log in to an Oracle Big Data Appliance server as the
Change to the /etc/sysconfig/network-scripts directory.
Copy the ifcfg-bondib0 file, using a name that does not start with ifcfg:
cp ifcfg-bondib0 orig_ifcfg-bondib0
Edit ifcfg-bondib0 to update the
Following is an example of an ifcfg-bondib0 file:
IPADDR=188.8.131.52 NETMASK=255.255.255.0 NETWORK=184.108.40.206 BROADCAST=220.127.116.11 DEVICE=bondib0 USERCTL=no ONBOOT=yes TYPE=bonding BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100" HOTPLUG=no IPV6INIT=no BOOTPROTO=none MTU=65520
Restart the server:
Verify the new InfiniBand IP address information:
# iblinkinfo | grep reconfigured_host_name
See "Checking the Health of the Network" for information about the
Make a backup copy of the /etc/hosts file:
# cp /etc/hosts /etc/orig_hosts
Edit the file to change the IP addresses.
Repeat this procedure on all Oracle Big Data Appliance servers.
The InfiniBand network connects the servers through the bondib0 interface to the InfiniBand switches in the rack. This section describes how to perform maintenance on the InfiniBand switches.
This section contains the following topics:
Oracle ILOM supports remote administration of the Oracle Big Data Appliance servers. This section explains how to back up and restore the Oracle ILOM configuration settings, which are set by the Mammoth Utility.
This section contains the following topics:
See Also:Oracle Integrated Lights Out Manager 3.0 documentation at
To back up the Oracle ILOM configuration settings:
Open your browser on any system on the same network as Oracle Big Data Appliance and enter the Oracle ILOM address of a server. This example uses the Oracle ILOM address of server 08:
Log in as the
ilom-admin user. The default password is
Select the Maintenance tab.
Select the Backup/Restore subtab.
Select the Backup operation and the Browser transfer method, as shown in Figure 14-1.
Enter a pass phrase. The phrase is used to encrypt sensitive information, such as passwords, in the backup.
Click Run to start the backup. The results are downloaded to your local system in an XML file named config_backup.xml.
Save the file in a secure location.
Click the Log Out button.
To restore the settings on a switch:
Open your browser on any system on the same network as Oracle Big Data Appliance and navigate to an Oracle ILOM on a server. This example uses Oracle ILOM on node08:
Log in as the
ilom-admin user. The default password is
Select the Maintenance tab.
Select the Backup/Restore tab.
Select the Restore operation and the Browser transfer method.
Click Choose File and select the config_backup.xml file saved previously in a backup operation.
Enter the pass phrase that was set during the backup.
Click Run to restore the configuration.
Complete these steps to replace a Sun Network QDR InfiniBand Gateway Switch or a Sun Datacenter InfiniBand Switch 36.
"In-Rack InfiniBand Switch-to-Server Cable Connections" for information on cabling
Sun Network QDR InfiniBand Gateway Switch Installation Guide at
Sun Datacenter InfiniBand Switch 36 User's Guide at
Disconnect the cables from the switch. All InfiniBand cables have labels at both ends indicating their locations. If any cables do not have labels, then label them.
Power off both power supplies on the switch by removing the power plugs.
Remove the switch from the rack.
Install the new switch in the rack.
Restore the switch settings using the backup, as described in "Backing Up and Restoring Oracle ILOM Switch Settings".
Connect to the switch as
ilom_admin and open the Fabric Management shell:
-> show /SYS/Fabric_Mgmt
The prompt changes from -> to FabMan@hostname->
Disable the Subnet Manager:
Connect the cables to the new switch, being careful to connect each cable to the correct port.
Verify that there are no errors on any links in the fabric:
FabMan@bda1sw-02-> ibdiagnet -c 1000 -r
Enable the Subnet Manager:
Note:If the replaced switch was the Sun Datacenter InfiniBand Switch 36 spine switch, then manually fail the master Subnet Manager back to the switch by disabling the Subnet Managers on the other switches until the spine switch becomes the master, then reenable the Subnet Manager on all the other switches.
If any component in the InfiniBand network has required maintenance, including replacing an InfiniBand Host Channel Adapter (HCA) on a server, an InfiniBand switch, or an InfiniBand cable, or if operation of the InfiniBand network is suspected to be substandard, then verify the InfiniBand network is operating properly. The following procedure describes how to verify network operation:
Note:Use this procedure used any time the InfiniBand network is performing below expectations.
To verify InfiniBand network operation:
ibdiagnet command to verify InfiniBand network quality:
# ibdiagnet -c 1000
Investigate all errors reported by this command. It generates a small amount of network traffic and can run during a normal workload.
Report switch port error counters and port configuration information. The
XmtWait errors are ignored by this command:
# ibqueryerrors.pl -rR -s LinkDowned,RcvSwRelayErrors,XmtDiscards,XmtWait
Check the status of the hardware:
The following is an example of the output:
[SUCCESS: Correct system model : SUN FIRE X4270 M2 SERVER [SUCCESS: Correct processor info : Intel(R) Xeon(R) CPU X5675 @ 3.07GHz [SUCCESS: Correct number of types of CPU : 1 [SUCCESS: Correct number of CPU cores : 24 [SUCCESS: Sufficient GB of memory (>=48): 48 [SUCCESS: Correct GB of swap space : 24 [SUCCESS: Correct BIOS vendor : American Megatrends Inc. [SUCCESS: Sufficient BIOS version (>=08080102): 08080102 [SUCCESS: Recent enough BIOS release date (>=05/23/2011) : 05/23/2011 [SUCCESS: Correct ILOM version : 18.104.22.168.a r68533 [SUCCESS: Correct number of fans : 6 [SUCCESS: Correct fan 0 status : ok [SUCCESS: Correct fan 1 status : ok [SUCCESS: Correct fan 2 status : ok [SUCCESS: Correct fan 3 status : ok [SUCCESS: Correct fan 4 status : ok [SUCCESS: Correct fan 5 status : ok [SUCCESS: Correct number of power supplies : 2 [1m[34mINFO: Detected Santa Clara Factory, skipping power supply checks [SUCCESS: Correct disk controller model : LSI MegaRAID SAS 9261-8i [SUCCESS: Correct disk controller firmware version : 12.12.0-0048 [SUCCESS: Correct disk controller PCI address : 13:00.0 [SUCCESS: Correct disk controller PCI info : 0104: 1000:0079 [SUCCESS: Correct disk controller PCIe slot width : x8 [SUCCESS: Correct disk controller battery type : iBBU08 [SUCCESS: Correct disk controller battery state : Operational [SUCCESS: Correct number of disks : 12 [SUCCESS: Correct disk 0 model : SEAGATE ST32000SSSUN2.0 [SUCCESS: Sufficient disk 0 firmware (>=61A): 61A [SUCCESS: Correct disk 1 model : SEAGATE ST32000SSSUN2.0 [SUCCESS: Sufficient disk 1 firmware (>=61A): 61A . . . [SUCCESS: Correct disk 10 status : Online, Spun Up No alert [SUCCESS: Correct disk 11 status : Online, Spun Up No alert [SUCCESS: Correct Host Channel Adapter model : Mellanox Technologies MT26428 ConnectX VPI PCIe 2.0 [SUCCESS: Correct Host Channel Adapter firmware version : 2.9.1000 [SUCCESS: Correct Host Channel Adapter PCI address : 0d:00.0 [SUCCESS: Correct Host Channel Adapter PCI info : 0c06: 15b3:673c [SUCCESS: Correct Host Channel Adapter PCIe slot width : x8 [SUCCESS: Big Data Appliance hardware validation checks succeeded
Check the status of the software:
[SUCCESS: Correct OS disk sda partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary [SUCCESS: Correct OS disk sdb partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary [SUCCESS: Correct data disk sdc partition info : 1 ext3 primary [SUCCESS: Correct data disk sdd partition info : 1 ext3 primary [SUCCESS: Correct data disk sde partition info : 1 ext3 primary [SUCCESS: Correct data disk sdf partition info : 1 ext3 primary [SUCCESS: Correct data disk sdg partition info : 1 ext3 primary [SUCCESS: Correct data disk sdh partition info : 1 ext3 primary [SUCCESS: Correct data disk sdi partition info : 1 ext3 primary [SUCCESS: Correct data disk sdj partition info : 1 ext3 primary [SUCCESS: Correct data disk sdk partition info : 1 ext3 primary [SUCCESS: Correct data disk sdl partition info : 1 ext3 primary [SUCCESS: Correct software RAID info : /dev/md2 level=raid1 num-devices=2 /dev/md0 level=raid1 num-devices=2 [SUCCESS: Correct mounted partitions : /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sda4 /u01 ext4 /dev/sdb4 /u02 ext4 /dev/sdc1 /u03 ext4 /dev/sdd1 /u04 ext4 /dev/sde1 /u05 ext4 /dev/sdf1 /u06 ext4 /dev/sdg1 /u07 ext4 /dev/sdh1 /u08 ext4 /dev/sdi1 /u09 ext4 /dev/sdj1 /u10 ext4 /dev/sdk1 /u11 ext4 /dev/sdl1 /u12 ext4 [SUCCESS: Correct swap partitions : /dev/sdb3 partition /dev/sda3 partition [SUCCESS: Correct Linux kernel version : Linux 2.6.32-200.21.1.el5uek [SUCCESS: Correct Java Virtual Machine version : HotSpot(TM) 64-Bit Server 1.6.0_29 [SUCCESS: Correct puppet version : 2.6.11 [SUCCESS: Correct MySQL version : 5.5.17 [SUCCESS: All required programs are accessible in $PATH [SUCCESS: All required RPMs are installed and valid [SUCCESS: Big Data Appliance software validation checks succeeded
The Subnet Manager manages all operational characteristics of the InfiniBand network, such as the following:
Discover the network topology
Assign a local identifier to all ports connected to the network
Calculate and program switch forwarding tables
Monitor changes in the fabric
The InfiniBand network can have multiple Subnet Managers, but only one Subnet Manager is active at a time. The active Subnet Manager is the Master Subnet Manager. The other Subnet Managers are the Standby Subnet Managers. If a Master Subnet Manager is shut down or fails, then a Standby Subnet Manager automatically becomes the Master Subnet Manager.
Each Subnet Manager has a configurable priority. When multiple Subnet Managers are on the InfiniBand network, the Subnet Manager with the highest priority becomes the master Subnet Manager. On Oracle Big Data Appliance, the Subnet Managers on the leaf switches are configured as priority 5, and the Subnet Managers on the spine switches are configured as priority 8.
The following guidelines determine where the Subnet Managers run on Oracle Big Data Appliance:
Only run the Subnet Managers on the switches in Oracle Big Data Appliance. Running the Subnet Manager on any other device is not supported.
When the InfiniBand network consists of one, two, or three racks cabled together, all switches must run the Subnet Manager. The master Subnet Manager runs on a spine switch.
When the InfiniBand network consists of four or more racks cabled together, then only the spine switches run the Subnet Manager. The leaf switches must disable the Subnet Manager.
Sun Network QDR InfiniBand Gateway Switch library at
Sun Datacenter InfiniBand Switch 36 library at
The configuration information for Network Time Protocol (NTP) servers can be changed after the initial setup. The following procedure describes how to change the NTP configuration information for InfiniBand switches, Cisco switches, and Sun Fire servers. Oracle recommends changing each server individually.
To update the Oracle Big Data Appliance servers:
Repeat these steps for each server.
To update the InfiniBand switches:
Log in to the switch as the
Follow the instructions in "Setting the Time Zone on an InfiniBand Switch".
To update the Cisco Ethernet switch:
Use telnet to connect to the Cisco Ethernet switch.
Delete the current setting:
# configure terminal Enter configuration commands, one per line. End with CNTL/Z. (config)# no ntp server current_IPaddress
Enter the new IP address:
# configure terminal Enter configuration commands, one per line. End with CNTL/Z. (config)# ntp server new_IPaddress
Save the current configuration:
# copy running-config startup-config
Exit from the session:
See Also:"Configuring the Cisco Ethernet Switch"
Restart Oracle Big Data Appliance after changing the servers and switches.