Recovering a Corrupted Boot Volume for Linux-Based Instances
If your instance fails to boot successfully or boots with the boot volume set to read-only access, the instance's boot volume may be corrupted. While it is rare, boot volume corruption can occur in the following scenarios:
When an instance experiences a forced shutdown using the API.
When an instance experiences a system hang due to an operating system or software error and a graceful reboot or shutdown of the instance times out, and then a forced shutdown occurs.
When an error or outage occurs in the underlying infrastructure and there were critical disk writes pending in the system.
In most cases a simple reboot will resolve boot volume corruption issues, so this is the first action you should take when troubleshooting this.
This topic describes how to determine if your Linux-based instance's boot volume is corrupted and what steps to take to troubleshoot and recover the corrupted boot volume. For Windows instances, see Recovering a Corrupted Boot Volume for Windows Instances.
Detecting Boot Volume Corruption
Boot volume corruption can prevent an instance from booting successfully, so you may not be able to connect to the instance using SSH. Instead, you can use the instance console connection feature to connect to the malfunctioning instance. For more information about using this feature, see Troubleshooting Instances Using Instance Console Connections.
This section describes how to use a serial console connection to detect if boot volume corruption has occurred.
If you have already confirmed your instance's boot volume is corrupted or if you are using an imported custom image, proceed to the Recovering the Boot Volume section, which describes how to use a second instance along with standard file system tools to both detect and repair boot volume corruption.
- Create a serial console connection for the instance.
At this point, it's normal for the serial console to appear to hang, as the system may have already crashed.
- Reboot the instance from the Console.
Once the reboot process starts, switch back to the terminal window, and you should see system messages from the instance start to appear in the window.
Monitor the messages that appear as the system is starting up. Most operating systems will set the boot volume to read-only as soon as disk corruption is detected to prevent writes from further corrupting the volume, so look for messages that indicate the boot volume is in read-only mode. Following are some examples:
On an instance with iSCSI-attached boot volumes, the
iscsiadmservice will fail to attach a volume because the volume is in read-only mode. This will typically prevent instances from continuing to boot. The serial console may display a message similar to the following:
iscsiadm: Maybe you are not root? iscsiadm: Could not lock discovery DB: /var/lock/iscsi/lock.write: Read-only file system touch: cannot touch `/var/lock/subsys/iscsid': Read-only file system touch: cannot touch `/var/lock/subsys/iscsi': Read-only file system
On an instance with paravirtualized-attached boot volumes, the system may continue the boot process, but will be in a degraded state because nothing can be written to the boot drive.The serial console may display error messages similar to the following:
[FAILED] Failed to start Create Volatile Files and Directories. See 'systemctl status systemd-tmpfiles-setup.service' for details. ... [ 27.160070] cloud-init: os.chmod(path, real_mode) [ 27.166027] cloud-init: OSError: [Errno 30] Read-only file system: '/var/lib/cloud/data'
The error messages and system behavior described here are the most commonly seen for boot volume corruption, however depending on the operating system, you may see different error messages and system behavior. If you don't see the ones described here, consult the documentation for your operating system for additional troubleshooting information.
Recovering the Boot Volume
To troubleshoot and recover the corrupted boot volume, you need to detach the boot volume from the instance and then attach the boot volume to a second instance as a data volume.
Detaching the Boot Volume
If you have detected that your instance's boot volume is corrupted, you need to detach the boot volume from the instance before you can begin troubleshooting and recovery steps.
Attaching the Boot Volume as a Data Volume to a Second Instance
For the second instance, we recommend that you use an instance running an operating system that most closely matches the operating system for the boot volume's instance. You should only attach boot volumes for Linux-based instances to other Linux-based instances. The second instance must be in the same availability domain and region as the boot volume's instance. If no existing instance is available, create a new Linux instance using the steps described in Creating an Instance. Once you have the second instance, make sure you can log into the instance and that it is functional before proceeding with the recovery steps. For steps to access the instance, see Connecting to a Linux Instance. After you have confirmed that the instance is functional, perform the following steps.
lsblkcommand and make note of the drives that are currently on the instance, for example:
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 46.6G 0 disk ├─sda2 8:2 0 8G 0 part [SWAP] ├─sda3 8:3 0 38.4G 0 part / └─sda1 8:1 0 200M 0 part /boot/efi
- Attach the boot volume to the second instance as a data volume. For more
information, see Attaching a Volume.To attach the boot volume as a data volume
- Open the navigation menu and click Compute. Under Compute, click Instances.
- Click the instance that you want to attach a volume to.
- Under Resources, click Attached Block Volumes.
- Click Attach Block Volume.
Select the volume attachment type. If Paravirtualized attachments are available for this instance, we recommend that you select this attachment type for this procedure.
If you select iSCSI as the volume attachment type, you need to connect to the volume, see Connecting to a Volume for more information.
- In the Block Volume Compartment drop-down list, select the compartment.
- Choose the Select Volume option and then select the volume from the Boot Volume section of the Block Volume drop-down list.
- Select Read/Write as the access type.
When the volume's icon no longer lists it as Attaching, proceed with the next steps.
- Run the
lsblkcommand again to confirm that the boot volume now shows up as a volume attached to the instance. In this sample output for the
lsblk, the boot volume attached as a data volume shows up as
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 46.6G 0 disk ├─sdb2 8:18 0 8G 0 part ├─sdb3 8:19 0 38.4G 0 part └─sdb1 8:17 0 200M 0 part sda 8:0 0 46.6G 0 disk ├─sda2 8:2 0 8G 0 part [SWAP] ├─sda3 8:3 0 38.4G 0 part / └─sda1 8:1 0 200M 0 part /boot/efi
fsckcommand on the volume's root partition. The root partition is usually the largest partition on the volume.
The following sample for the
fsckcommand shows the output when there are no errors or corruption present on the partitions for an Oracle 7.6 instance:
sudo fsck -V /dev/sdb1 fsck from util-linux 2.23.2 [/sbin/fsck.vfat (1) -- /boot/efi] fsck.vfat /dev/sdb1 fsck.fat 3.0.20 (12 Jun 2013) /dev/sdb1: 17 files, 2466/51145 clusters sudo fsck -V /dev/sdb2 fsck from util-linux 2.23.2 sudo fsck -V /dev/sdb3 fsck from util-linux 2.23.2 [/sbin/fsck.xfs (1) -- /] fsck.xfs /dev/sdb3 If you wish to check the consistency of an XFS filesystem or repair a damaged filesystem, see xfs_repair(8).
If errors are present on a partition, you will usually be prompted to repair the errors. Following is an example of an interactive repair session of a corrupt ext4 boot volume for an Ubuntu instance:
sudo fsck -V /dev/sdb1 fsck from util-linux 2.31.1 [/sbin/fsck.ext4 (1) -- /] fsck.ext4 /dev/sdb1 e2fsck 1.44.1 (24-Mar-2018) One or more block group descriptor checksums are invalid. Fix<y> yes Group descriptor 92 checksum is 0xe9a1, should be 0x1f53. FIXED. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: Group 92 block bitmap does not match checksum. FIXED. cloudimg-rootfs: ***** FILE SYSTEM WAS MODIFIED ***** cloudimg-rootfs: 75336/5999616 files (0.1% non-contiguous), 798678/12181243 blocksNote
XFS file systems will usually auto-repair their contents when the system boots up, fixing any corruption during the boot process. You can use the
xfs_repaircommand to force a repair for scenarios where boot volume corruption is preventing the auto-repair functionality from working, as shown in the following example:
sudo xfs_repair /dev/sdb3 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... ... Phase 7 - verify and correct link counts... done