9.12 Suspend Error During Storage Live Migration

When migrating running virtual machines that use local storage, the virtual disks are migrated first from the source server's local repository to the destination server's local repository. In certain cases, during the storage live migration, a suspend error may occur for the source Oracle VM Server guest virtual machine when there is additional disk activity. For example, additional disk activity may involve a process that is running on the guest virtual machine that is writing to the virtual disk on the source server's local repository while it is being migrated to the destination server's local repository.

The suspend operation requires more time to complete due to the additional disk activity, resulting in a suspend error that causes the storage live migration to fail. A suspend error similar to the following is displayed in the /var/log/xen/xend.log on the source Oracle VM Server instance:

[timestamp 2792] DEBUG (XendDomainInfo:3509) waitForSuspend: domain 1 state:
[timestamp 2792] INFO (XendCheckpoint:430) [9836] xc: error:
Suspend request failed: Internal error
[timestamp 2792] DEBUG (XendDomainInfo:1943) XendDomainInfo.handleShutdownWatch
[timestamp 2792] INFO (XendCheckpoint:430) [9836] xc: error: Domain appears not
to have suspended: Internal error 

A timeout error is also displayed in the kernel log on the source Oracle VM Server guest virtual machine:

timestamp server kernel: [ 1601.227167] Freezing of tasks failed
after xx seconds (0 tasks refusing to freeze, wq_busy=1):

Workaround: Update the /sys/power/pm_freeze_timeout parameter, which is specified in milliseconds on the source Oracle VM Server guest virtual machine to ensure adequate time is available when additional disk activity occurs during storage live migration:

  1. Log in to the guest virtual machine.

  2. It is strongly recommended to update the pm_freeze_timeout parameter to the maximum downtime that you can afford, up to a maximum of 5 minutes, or 300000 milliseconds. For example, to update the /sys/power/pm_freeze_timeout parameter to 5 minutes, run the following command on the guest virtual machine:

    # echo "300000" > /sys/power/pm_freeze_timeout

    Note

    The value you set for the pm_freeze_timeout parameter impacts the amount of time the source Oracle VM Server guest virtual machine remains offline during storage live migration.

  3. Perform the live migration with storage again and no suspend errors should occur.

If a suspend error still occurs for the source Oracle VM Server guest virtual machine after the pm_freeze_timeout parameter is updated, you may need to avoid storage live migration until the additional disk activity completes, or reconsider the maximum time you can afford for the guest virtual machine to remain offline during storage live migration.

Bug 26289880