This chapter contains the following sections:
Note:
The failure of a disk is never catastrophic on Oracle Big Data Appliance. No user data is lost. Data stored in HDFS or Oracle NoSQL Database is automatically replicated.
Repair of the physical disks does not require shutting down Oracle Big Data Appliance. However, individual servers can be taken outside of the cluster temporarily and require downtime.
See Also:
My Oracle Support Doc ID 1581331.1 My Oracle Support Doc ID 1581331.1
"Parts for Oracle Big Data Appliance Servers" for disk repair procedures
The 12 disk drives in each Oracle Big Data Appliance server are controlled by an LSI MegaRAID SAS 92610-8i disk controller. Oracle recommends verifying the status of the RAID devices to avoid possible performance degradation or an outage. The effect on the server of validating the RAID devices is minimal. The corrective actions may affect operation of the server and can range from simple reconfiguration to an outage, depending on the specific issue uncovered.
Enter this command to verify the disk controller configuration:
# MegaCli64 -AdpAllInfo -a0 | grep "Device Present" -A 8
The following is an example of the output from the command. There should be 12 virtual drives, no degraded or offline drives, and 14 physical devices. The 14 devices are the controllers and the 12 disk drives.
                Device Present
                ================
Virtual Drives    : 12 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 0 
  Failed Disks    : 0 
If the output is different, then investigate and correct the problem.
Enter this command to verify the virtual drive configuration:
# MegaCli64 -LDInfo -lAll -a0
The following is an example of the output for Virtual Drive 0. Ensure that State is Optimal.
Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 Size : 1.817 TB Parity Size : 0 State : Optimal Strip Size : 64 KB Number Of Drives : 1 Span Depth : 1 Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disk's Default Encryption Type : None
Use the following command to verify the physical drive configuration:
# MegaCli64 -PDList -a0 | grep Firmware
The following is an example of the output from the command. The 12 drives should be Online, Spun Up. If the output is different, then investigate and correct the problem.
Firmware state: Online, Spun Up
Device Firmware Level: 061A
Firmware state: Online, Spun Up
Device Firmware Level: 061A
Firmware state: Online, Spun Up
Device Firmware Level: 061A
     .
     .
     .
The following are the basic steps for replacing a server disk drive:
See Also:
"Servicing Storage Drives and Rear Drives" in the Oracle Server X6-2L Service Manual at
http://docs.oracle.com/cd/E62172_01/html/E62184/z400001c165586.html#scrolltoc
"Servicing Storage Drives and Rear Drives" in the Oracle Server X5-2L Service Manual at
http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.z40000091011460.html#scrolltoc
"Servicing Storage Drives and Boot Drives" in the Sun Fire X4270M2 Server Service Manual at
http://docs.oracle.com/cd/E19245-01/E21671/hotswap.html#50503714_61628
The Oracle Big Data Appliance servers contain a disk enclosure cage that is controlled by the host bus adapter (HBA). The enclosure holds 12 disk drives that are identified by slot numbers 0 to 11. The drives can be dedicated to specific functions, as shown in Table 13-1.
Oracle Big Data Appliance uses symbolic links, which are defined in /dev/disk/by_hba_slot, to identify the slot number of a disk. The links have the form snpm, where n is the slot number and m is the partition number. For example, /dev/disk/by_hba_slot/s0p1 initially corresponds to /dev/sda1.
When a disk is hot swapped, the operating system cannot reuse the kernel device name. Instead, it allocates a new device name. For example, if you hot swap /dev/sda, then the disk corresponding /dev/disk/by-hba-slot/s0 might link to /dev/sdn instead of /dev/sda. Therefore, the links in /dev/disk/by-hba-slot/ are automatically updated when devices are added or removed.
The command output lists device names as kernel device names instead of symbolic link names. Thus, /dev/disk/by-hba-slot/s0 might be identified as /dev/sda in the output of a command.
Table 13-1 shows typical initial mappings between the RAID logical drives and the operating system identifiers. Nonetheless, you must use the mappings that exist for your system, which might be different from the ones listed here. The table also identifies the dedicated function of each drive in an Oracle Big Data Appliance server. The server with the failed drive is part of either a CDH cluster (HDFS) or an Oracle NoSQL Database cluster.
Table 13-1 Disk Drive Identifiers
| Symbolic Link to Physical Slot | Typical Initial Kernel Device Name | Dedicated Function | 
|---|---|---|
| /dev/disk/by-hba-slot/s0 | /dev/sda | Operating system | 
| /dev/disk/by-hba-slot/s1 | /dev/sdb | Operating system | 
| /dev/disk/by-hba-slot/s2 | /dev/sdc | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s3 | /dev/sdd | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s4 | /dev/sde | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s5 | /dev/sdf | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s6 | /dev/sdg | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s7 | /dev/sdh | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s8 | /dev/sdi | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s9 | /dev/sdj | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s10 | /dev/sdk | HDFS or Oracle NoSQL Database | 
| /dev/disk/by-hba-slot/s11 | /dev/sdl | HDFS or Oracle NoSQL Database | 
Table 13-2 show the mappings between HDFS partitions and mount points.
Table 13-2 Mount Points
| Symbolic Link to Physical Slot and Partition | HDFS Partition | Mount Point | 
|---|---|---|
| /dev/disk/by-hba-slot/s0p4 | /dev/sda4 | /u01 | 
| /dev/disk/by-hba-slot/s1p4 | /dev/sdb4 | /u02 | 
| /dev/disk/by-hba-slot/s2p1 | /dev/sdc1 | /u03 | 
| /dev/disk/by-hba-slot/s3p1 | /dev/sdd1 | /u04 | 
| /dev/disk/by-hba-slot/s4p1 | /dev/sde1 | /u05 | 
| /dev/disk/by-hba-slot/s5p1 | /dev/sdf1 | /u06 | 
| /dev/disk/by-hba-slot/s6p1 | /dev/sdg1 | /u07 | 
| /dev/disk/by-hba-slot/s7p1 | /dev/sdh1 | /u08 | 
| /dev/disk/by-hba-slot/s8p1 | /dev/sdi1 | /u09 | 
| /dev/disk/by-hba-slot/s9p1 | /dev/sdj1 | /u10 | 
| /dev/disk/by-hba-slot/s10p1 | /dev/sdk1 | /u11 | 
| /dev/disk/by-hba-slot/s11p1 | /dev/sdl1 | /u12 | 
Use the following MegaCli64 command to verify the mapping of virtual drive numbers to physical slot numbers. See "Replacing a Disk Drive."
# MegaCli64 LdPdInfo a0 | more
To replace an HDFS disk or an operating system disk that is in a state of predictive failure, you must first dismount the HDFS partitions. You must also turn off swapping before replacing an operating system disk.
Note:
Only dismount HDFS partitions. For an operating system disk, ensure that you do not dismount operating system partitions. Only partition 4 (sda4 or sdb4) of an operating system disk is used for HDFS.
To dismount HDFS partitions:
Log in to the server with the failing drive.
If the failing drive supported the operating system, then turn off swapping:
# bdaswapoff
Removing a disk with active swapping crashes the kernel.
List the mounted HDFS partitions:
# mount -l
/dev/md2 on / type ext4 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext4 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
     .
     .
     .
Check the list of mounted partitions for the failing disk. If the disk has no partitions listed, then proceed to "Replacing a Disk Drive." Otherwise, continue to the next step.
Caution:
For operating system disks, look for partition 4 (sda4 or sdb4). Do not dismount an operating system partition.
Dismount the HDFS mount points for the failed disk:
# umount mountpoint
For example, umount /u11 removes the mount point for partition /dev/sdk1.
If the umount commands succeed, then proceed to "Replacing a Disk Drive." If a umount command fails with a device busy message, then the partition is still in use. Continue to the next step.
Open a browser window to Cloudera Manager. For example:
http://bda1node03.example.com:7180
Complete these steps in Cloudera Manager:
Note:
If you remove mount points in Cloudera Manager as described in the following steps, then you must restore these mount points in Cloudera Manager after finishing all other configuration procedures.
Log in as admin.
On the Services page, click hdfs
Click the Instances subtab.
In the Host column, locate the server with the failing disk. Then click the service in the Name column, such as datanode, to open its page.
Click the Configuration subtab.
Remove the mount point from the Directory field.
Click Save Changes.
From the Actions list, choose Restart this DataNode.
In Cloudera Manager, remove the mount point from NodeManager Local Directories:
On the Services page, click Yarn.
In the Status Summary, click NodeManager.
From the list, click to select the NodeManager that is on the host with the failed disk.
Click the Configuration subtab.
Remove the mount point from the NodeManager.
Click Save Changes.
Restart the NodeManager.
If you have added any other roles that store data on the same HDFS mount point (such as HBase Region Server), then remove and restore the mount points for these roles in the same way.
Return to your session on the server with the failed drive.
Reissue the umount command:
# umount mountpoint
If the umount still fails, run lsof to list open files under the HDFS mount point and the processes that opened them. This may help you to identify the process that is preventing the unmount. For example:
# lsof | grep /u11
Bring the disk offline:
# MegaCli64 PDoffline "physdrv[enclosure:slot]" a0
For example, "physdrv[20:10]" identifies disk s11, which is located in slot 10 of enclosure 20.
Delete the disk from the controller configuration table:
MegaCli64 CfgLDDel Lslot a0 
For example, L10 identifies slot 10.
Complete the steps in "Replacing a Disk Drive."
The server may restart during the disk replacement procedures, either because you issued a reboot command or made an error in a MegaCli64 command. In most cases, the server restarts successfully, and you can continue working. However, in other cases, an error occurs so that you cannot reconnect using ssh. In this case, you must complete the reboot using Oracle ILOM.
To restart a server using Oracle ILOM:
Use your browser to open a connection to the server using Oracle ILOM. For example:
http://bda1node12-c.example.com
Note:
Your browser must have a JDK plug-in installed. If you do not see the Java coffee cup on the log-in page, then you must install the plug-in before continuing.
Log in using your Oracle ILOM credentials.
Select the Remote Control tab.
Click the Launch Remote Console button.
Enter Ctrl+d to continue rebooting.
If the reboot fails, then enter the server root password at the prompt and attempt to fix the problem.
After the server restarts successfully, open the Redirection menu and choose Quit to close the console window.
See Also:
Oracle Integrated Lights Out Manager (ILOM) 3.0 documentation at
Complete this procedure to replace a failed or failing disk drives.
Before replacing a failing disk, see "Prerequisites for Replacing a Failing Disk."
Replace the failed disk drive.
Power on the server if you powered it off to replace the failed disk.
Connect to the server as root using either the KVM or an SSL connection to a laptop.
Store the physical drive information in a file:
# MegaCli64 pdlist a0 > pdinfo.tmp
Note: This command redirects the output to a file so that you can perform several searches using a text editor. If you prefer, you can pipe the output through the more or grep commands.
The utility returns the following information for each slot. This example shows a Firmware State of Unconfigured(good), Spun Up.
Enclosure Device ID: 20 Slot Number: 8 Drive's postion: DiskGroup: 8, Span: 0, Arm: 0 Enclosure position: 0 Device Id: 11 WWN: 5000C5003487075C Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Unconfigured(good), Spun Up Is Commissioned Spare : NO Device Firmware Level: 061A Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c5003487075d SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1126L6M3WX FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device . . .
Open the file you created in Step 5 in a text editor and search for the following:
For disks that have a Foreign State of Foreign, clear that status:
# MegaCli64 CfgForeign clear a0
A foreign disk is one that the controller saw previously, such as a reinserted disk.
For disks that have a Firmware State of Unconfigured (Bad), complete these steps:
Note the enclosure device ID number and the slot number.
Enter a command in this format:
# MegaCli64 pdmakegood physdrv[enclosure:slot] a0
For example, [20:10] repairs the disk identified by enclosure 20 in slot 10.
Check the current status of Foreign State again:
# MegaCli64 pdlist a0 | grep foreign
If the Foreign State is still Foreign, then repeat the clear command:
# MegaCli64 CfgForeign clear a0
For disks that have a Firmware State of Unconfigured (Good), use the following command. If multiple disks are unconfigured, then configure them in order from the lowest to the highest slot number:
# MegaCli64 CfgLdAdd r0[enclosure:slot] a0 Adapter 0: Created VD 1 Adapter 0: Configured the Adapter!! Exit Code: 0x00
For example, [20:5] repairs the disk identified by enclosure 20 in slot 5.
If the CfgLdAdd command in Step 9 fails because of cached data, then clear the cache:
# MegaCli64 discardpreservedcache l1 a0
Verify that the disk is recognized by the operating system:
# lsscsi
The disk may appear with its original device name (such as /dev/sdc) or under a new device name (such as /dev/sdn). If the operating system does not recognize the disk, then the disk is missing from the list generated by the lsscsi command.
The lssci output might not show the correct order, but you can continue with the configuration. While the same physical to logical disk mapping is required, the same disk to device mapping for the kernel is not required. The disk configuration is based on /dev/disks/by-hba-slot device names.
This example output shows two disks with new device names: /dev/sdn in slot 5, and /dev/sdo in slot 10.
[0:0:20:0] enclosu ORACLE CONCORD14 0960 - [0:2:0:0] disk LSI MR9261-8i 2.12 /dev/sda [0:2:1:0] disk LSI MR9261-8i 2.12 /dev/sdb [0:2:2:0] disk LSI MR9261-8i 2.12 /dev/sdc [0:2:3:0] disk LSI MR9261-8i 2.12 /dev/sdd [0:2:4:0] disk LSI MR9261-8i 2.12 /dev/sde [0:2:5:0] disk LSI MR9261-8i 2.12 /dev/sdn [0:2:6:0] disk LSI MR9261-8i 2.12 /dev/sdg [0:2:7:0] disk LSI MR9261-8i 2.12 /dev/sdh [0:2:8:0] disk LSI MR9261-8i 2.12 /dev/sdi [0:2:9:0] disk LSI MR9261-8i 2.12 /dev/sdj [0:2:10:0] disk LSI MR9261-8i 2.12 /dev/sdo [0:2:11:0] disk LSI MR9261-8i 2.12 /dev/sdl [7:0:0:0] disk ORACLE UNIGEN-UFD PMAP /dev/sdm [
Check the hardware profile of the server, and correct any errors:
# bdacheckhw
Check the software profile of the server, and correct any errors:
# bdachecksw
If you see a "Wrong mounted partitions" error and the device is missing from the list, then you can ignore the error and continue. However, if you see a "Duplicate mount points" error or the slot numbers are switched, then see "Correcting a Mounted Partitions Error".
Identify the function of the drive, so you configure it properly. See "Identifying the Function of a Disk Drive".
When the bdachecksw utility finds a problem, it typically concerns the mounted partitions.
An old mount point might appear in the mount command output, so that the same mount point, such as /u03, appears twice.
To fix duplicate mount points:
Dismount both mount points by using the umount command twice. This example dismounts two instances of/u03:
# umount /u03 # umount /u03
Remount the mount point. This example remounts /u03:
# mount /u03
If a disk is in the wrong slot (that is, the virtual drive number), then you can switch two drives.
To switch slots:
Remove the mappings for both drives. This example removes the drives from slots 4 and 10:
# MegaCli64 cfgLdDel L4 a0 # MegaCli64 cfgLdDel L10 a0
Add the drives in the order you want them to appear; the first command obtains the first available slot number:
# MegaCli64 cfgLdAdd [20:4] a0 # MegaCli64 cfgLdAdd [20:5] a0
If mount errors persist even when the slot numbers are correct, then you can restart the server.
The server with the failed disk is configured to support either HDFS or Oracle NoSQL Database, and most disks are dedicated to that purpose. However, two disks are dedicated to the operating system. Before configuring the new disk, find out how the failed disk was configured.
Oracle Big Data Appliance is configured with the operating system on the first two disks.
To confirm that a failed disk supported the operating system:
Check whether the replacement disk corresponds to /dev/sda or /dev/sdb, which are the operating system disks.
# lsscsi
See the output from Step 11 of "Replacing a Disk Drive".
Verify that /dev/sda and /dev/sdb are the operating system mirrored partitioned disks:
# mdadm -Q –-detail /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Mon Jul 22 22:56:19 2013
     Raid Level : raid1
     .
     .
     .
    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
If the previous steps indicate that the failed disk is an operating system disk, then proceed to "Configuring an Operating System Disk" .
The first two disks support the Linux operating system. These disks store a copy of the mirrored operating system, a swap partition, a mirrored boot partition, and an HDFS data partition.
To configure an operating system disk, you must copy the partition table from the surviving disk, create an HDFS partition (ext4 file system), and add the software raid partitions and boot partitions for the operating system.
Complete these procedures after replacing the disk in either slot 0 or slot 1.
The partitioning procedure differs slightly between Oracle Linux versions 5 and 6. Follow the appropriate procedure for your system:
Note:
Replace /dev/disk/by-hba-slot/sn in the following commands with the appropriate symbolic link, either /dev/disk/by-hba-slot/s0 or /dev/disk/by-hba-slot/s1.
After partitioning the disks, you can repair the two logical RAID arrays:
/dev/md0 contains /dev/disk/by-hba-slot/s0p1 and /dev/disk/by-hba-slot/s1p1. It is mounted as /boot.
/dev/md2 contains /dev/disk/by-hba-slot/s0p2 and /dev/disk/by-hba-slot/s1p2. It is mounted as / (root).
Caution:
Do not dismount the /dev/md devices, because that action shuts down the system.
To repair the RAID arrays:
Remove the partitions from the RAID arrays:
# mdadm /dev/md0 -r detached # mdadm /dev/md2 -r detached
Verify that the RAID arrays are degraded:
# mdadm -Q –-detail /dev/md0 # mdadm -Q –-detail /dev/md2
Verify that the degraded file for each array is set to 1:
# cat /sys/block/md0/md/degraded 1 # cat /sys/block/md2/md/degraded 1
Restore the partitions to the RAID arrays:
# mdadm –-add /dev/md0 /dev/disk/by-hba-slot/snp1 # mdadm –-add /dev/md2 /dev/disk/by-hba-slot/snp2
Check that resynchronization is started, so that /dev/md2 is in a state of recovery and not idle:
# cat /sys/block/md2/md/sync_action
repair
To verify that resynchronization is proceeding, you can monitor the mdstat file. A counter identifies the percentage complete.
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
      204736 blocks [2/2] [UU]
 
md2 : active raid1 sdb2[2] sda2[0]
      174079936 blocks [2/1] [U_]
      [============>........]  recovery = 61.6% (107273216/174079936) finish=18.4min speed=60200K/sec
The following output shows that synchronization is complete:
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
      204736 blocks [2/2] [UU]
 
md2 : active raid1 sdb2[1] sda2[0]
      174079936 blocks [2/2] [UU]
 
unused devices: <none>
Display the content of /etc/mdadm.conf:
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=df1bd885:c1f0f9c2:25d6...
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=6c949a1a:1d45b778:a6da...
Compare the output of the following command with the content of /etc/mdadm.conf from Step 7:
# mdadm --examine --brief --scan --config=partitions
If the UUIDs in the file are different from UUIDs in the output of the mdadm command:
Open /etc/mdadm.conf in a text editor.
Select from ARRAY to the end of the file, and delete the selected lines.
Copy the output of the command into the file where you deleted the old lines.
Save the modified file and exit.
Complete the steps in "Formatting the HDFS Partition of an Operating System Disk".
Partition 4 (sda4) on an operating system disk is used for HDFS. After you format the partition and set the correct label, HDFS rebalances the job load to use the partition if the disk space is needed.
To format the HDFS partition:
Format the HDFS partition as an ext4 file system:
# mkfs -t ext4 /dev/disk/by-hba-slot/snp4
Note:
If this command fails because the device is mounted, then dismount the drive now and skip step 3. See "Prerequisites for Replacing a Failing Disk" for dismounting instructions.
Verify that the partition label (such as /u01 for s0p4) is missing:
# ls -l /dev/disk/by-label
Dismount the appropriate HDFS partition, either /u01 for /dev/sda, or /u02 for /dev/sdb:
# umount /u0n
Reset the partition label:
# tune2fs -c -1 -i 0 -m 0.2 -L /u0n /dev/disk/by-hba-slot/snp4
Mount the HDFS partition:
# mount /u0n
Complete the steps in "Restoring the Swap Partition".
To restore the swap partition:
Set the swap label:
# mkswap -L SWAP-sdn3 /dev/disk/by-hba-slot/snp3 Setting up swapspace version 1, size = 12582907 kB LABEL=SWAP-sdn3, no uuid
Verify that the swap partition is restored:
# bdaswapon; bdaswapoff
Filename                          Type            Size    Used    Priority
/dev/sda3                         partition       12287992        0       1
/dev/sdb3                         partition       12287992        0       1
Verify that the replaced disk is recognized by the operating system:
$ ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Aug  3 01:22 BDAUSB -> ../../sdn1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 BDAUSBBOOT -> ../../sdm1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 SWAP-sda3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Aug  3 01:22 SWAP-sdb3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u02 -> ../../sdb4
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u03 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u04 -> ../../sdd1
     .
     .
     .
If the output does not list the replaced disk:
On Linux 5, run udevtrigger.
On Linux 6, run udevadm trigger.
Then repeat step 3. The lsscsi command should also report the correct order of the disks.
Complete the steps in "Restoring the GRUB Master Boot Records and HBA Boot Order".
After restoring the swap partition, you can restore the Grand Unified Bootloader (GRUB) master boot record.
The device.map file maps the BIOS drives to operating system devices. The following is an example of a device map file:
# this device map was generated by anaconda (hd0) /dev/sda (hd1) /dev/sdb
However, the GRUB device map does not support symbolic links, and the mappings in the device map might not correspond to those used by /dev/disk/by-hba-slot. The following procedure explains how you can correct the device map if necessary.
To restore the GRUB boot record:
Check which kernel device the drive is using in slot1
# ls -ld /dev/disk/by-hba-slot/s1 lrwxrwxrwx 1 root root 9 Apr 22 12:54 /dev/disk/by-hba-slot/s1 -> ../../sdb
If the output displays/dev/sdb as shown in step 1, then proceed to the next step (open GRUB).
If another device is displayed, such as /dev/sdn, then you must first set hd1 to point to the correct device:
Make a copy of the device.map file:
# cd /boot/grub # cp device.map mydevice.map # ls -l *device* -rw-r--r-- 1 root root 85 Apr 22 14:50 device.map -rw-r--r-- 1 root root 85 Apr 24 09:24 mydevice.map
Edit mydevice.map to point hd1 to the new device. In this example, s1 pointed to /deb/sdn in step 1.
# more /boot/grub/mydevice.map # this device map was generated by bda install (hd0) /dev/sda (hd1) /dev/sdn
Use the edited device map (mydevice.map) in the remaining steps.
Open GRUB, using either device.map as shown, or the edited mydevice.map:
# grub --device-map=/boot/grub/device.map
    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)
 
 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.
]
Set the root device, entering hd0 for /dev/sda, or hd1 for /dev/sdb:
grub> root (hdn,0) root (hdn,0) Filesystem type is ext2fs, partition type 0x83
Install GRUB, entering hd0 for /dev/sda, or hd1 for /dev/sdb:
grub> setup (hdn) setup (hdn) Checking if "/boot/grub/stage1" exists... no Checking if "/grub/stage1" exists... yes Checking if "/grub/stage2" exists... yes Checking if "/grub/e2fs_stage1_5" exists... yes Running "embed /grub/e2fs_stage1_5 (hdn)"... failed (this is not fatal) Running "embed /grub/e2fs_stage1_5 (hdn,0)"... failed (this is not fatal) Running "install /grub/stage1 (hdn) /grub/stage2 p /grub/grub.conf "... succeeded Done.
Close the GRUB command-line interface:
grub> quit
Ensure that the boot drive in the HBA is set correctly:
# MegaCli64 /c0 show bootdrive
If BootDrive VD:0 is set, the command output is as follows:
Controller = 0 Status = Success Description = None Controller Properties : ===================== ---------------- Ctrl_Prop Value ---------------- BootDrive VD:0 ----------------
If BootDrive VD:0 is not set, the command output shows No Boot Drive:
Controller = 0 Status = Success Description = None Controller Properties : ===================== ---------------- Ctrl_Prop Value ---------------- BootDrive No Boot Drive ----------------
If MegaCli64 /c0 show bootdrive reports that the boot drive is not set, then set it as follows:
# MegaCli64 /c0/v0 set bootdrive=on
Controller = 0
Status = Success
Description = None
Detailed Status :
===============
-----------------------------------------
VD  Property   Value Status   ErrCd ErrMsg
----------------------------------------- 
0   Boot Drive On    Success  0     - 
------------------------------------------
Verify that the boot drive is now set:
# MegaCli64 /c0 show bootdrive
Controller = 0
Status = Success
Description = None
Controller Properties :
=====================
----------------
Ctrl_Prop Value
----------------
BootDrive VD:0 
----------------
Ensure that the auto-select boot drive feature is enabled:
# MegaCli64 adpBIOS EnblAutoSelectBootLd a0
Auto select Boot is already Enabled on Adapter 0.
Check the configuration. See "Verifying the Disk Configuration" .
Complete the following instructions for any disk that is not used by the operating system. See "Identifying the Function of a Disk Drive".
To configure a disk, you must partition and format it.
Note:
Replace snp1 in the following commands with the appropriate symbolic name, such as s4p1.
To format a disk for use by HDFS or Oracle NoSQL Database:
Complete the steps in "Replacing a Disk Drive", if you have not done so already.
Partition the drive:
# parted /dev/disk/by-hba-slot/sn -s mklabel gpt mkpart primary ext4 0% 100%
Format the partition for an ext4 file system:
# mkfs -t ext4 /dev/disk/by-hba-slot/snp1
Reset the appropriate partition label to the missing device. See Table 13-2.
# tune2fs -c -1 -i 0 -m 0.2 -L /unn /dev/disk/by-hba-slot/snp1
For example, this command resets the label for /dev/disk/by-hba-slot/s2p1 to /u03:
# tune2fs -c -1 -i 0 -m 0.2 -L /u03 /dev/disk/by-hba-slot/s2p1 Setting maximal mount count to -1 Setting interval between checks to 0 seconds Setting reserved blocks percentage to 0.2% (976073 blocks)
Verify that the replaced disk is recognized by the operating system:
$ ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Aug  3 01:22 BDAUSB -> ../../sdn1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 BDAUSBBOOT -> ../../sdm1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 SWAP-sda3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Aug  3 01:22 SWAP-sdb3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u02 -> ../../sdb4
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u03 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug  3 01:22 u04 -> ../../sdd1
     .
     .
     .
If the output does not list the replaced disk:
On Linux 5, run udevtrigger.
On Linux 6, run udevadm trigger.
Then repeat step 5. The lsscsi command should also report the correct order of the disks.
Mount the HDFS partition, entering the appropriate mount point:
# mount /unn
For example, mount /u03.
If you are configuring multiple drives, then repeat the previous steps.
If you previously removed a mount point in Cloudera Manager for an HDFS drive, then restore it to the list.
Open a browser window to Cloudera Manager. For example:
http://bda1node03.example.com:7180
Open Cloudera Manager and log in as admin.
On the Services page, click hdfs.
Click the Instances subtab.
In the Host column, locate the server with the replaced disk. Then click the service in the Name column, such as datanode, to open its page.
Click the Configuration subtab.
If the mount point is missing from the Directory field, then add it to the list.
Click Save Changes.
From the Actions list, choose Restart.
If you previously removed a mount point from NodeManager Local Directories, then also restore it to the list using Cloudera Manager.
On the Services page, click Yarn.
In the Status Summary, click NodeManager.
From the list, click to select the NodeManager that is on the host with the failed disk.
Click the Configuration sub-tab.
If the mount point is missing from the NodeManager Local Directories field, then add it to the list.
Click Save Changes.
From the Actions list, choose Restart.
Check the configuration. See "Verifying the Disk Configuration" .
Before you can reinstall the Oracle Big Data Appliance software on the server, you must verify that the configuration is correct on the new disk drive.
To verify the disk configuration:
Check the software configuration:
# bdachecksw
If there are errors, then redo the configuration steps as necessary to correct the problem.
Check the /root directory for a file named BDA_REBOOT_SUCCEEDED.
If you find a file named BDA_REBOOT_FAILED, then read the file to identify and fix any additional problems.
Use this script to generate a BDA_REBOOT_SUCCEEDED file:
# /opt/oracle/bda/lib/bdastartup.sh
Verify that BDA_REBOOT_SUCCEEDED exists. If you still find a BDA_REBOOT_FAILED file, then redo the previous steps.