Migration Issues

Language:

Live Migration Might Result In Memory Corruption or Lost Kernel Panic Crash Dumps

Bug ID 20612716: A live migration of a guest domain that runs Oracle Solaris 11.2 SRU 8 from a machine with firmware based on Hypervisor 1.14.x to a machine with Hypervisor 1.13.2 might result in memory corruption or lost kernel panic crash dumps after the guest is rebooted.

This problem affects the following live migrations:

For SPARC T4-based systems, this failure occurs when migrating from a system that runs firmware version 8.7.x to a system that runs up to firmware version 8.6.x.
For SPARC T5-based systems and other systems that use the 9.x firmware, this failure occurs when migration from a system that runs firmware version 9.4.x to a system that runs up to firmware version 9.3.x.

Note - Because of the related bug 20594568, you should use this workaround when performing a live migration from any system that has firmware with Hypervisor 1.14.x to any system that has firmware with Hypervisor 1.13.x:

From a system that runs firmware version 8.7.x to a system that runs up to firmware version 8.6.x
From as system that runs firmware version 9.4.x to a system that runs up to firmware version 9.3.x

Workaround: To avoid the problem, add the following line to the /etc/system file on the domain being migrated:

set retained_mem_already_checked=1

For information about correctly creating or updating /etc/system property values, see Updating Property Values in the /etc/system File in Oracle VM Server for SPARC 3.2 Administration Guide .

Then, reboot the domain before attempting to migrate from Hypervisor version 1.14.x to Hypervisor version 1.13.2.

If the guest domain has been migrated already from firmware 8.7.x to 8.6.x or from 9.4.x to 9.3.x, stop and restart the guest domain. For example:

primary# ldm stop-domain domainname
primary# ldm start-domain domainname

A Live Migration of an Oracle Solaris 11.2 SRU 8 Guest Domain to a Target Machine With Version 1.13.1 Hypervisor Is Blocked

Bug ID 20594568: A live migration of a guest domain that runs Oracle Solaris 11.2 SRU 8 from a machine with firmware based on Hypervisor 1.14.x to a machine with Hypervisor 1.13.1 is blocked.

primary# ldm migrate ldg0 target-host
Target Password:
API group 0x11d v1.0 is not supported in the version of the firmware
running on the target machine.
Domain ldg0 is using features of the system firmware that
are not supported in the version of the firmware running on
the target machine.

This problem affects the following live migrations:

For SPARC T4-based systems, this failure occurs when migrating from a system that runs firmware version 8.7.x to a system that runs up to firmware version 8.5.x.
For SPARC T5-based systems and other systems that use the 9.x firmware, this failure occurs when migration from a system that runs firmware version 9.4.x to a system that runs up to firmware version 9.2.1.c.

Note - Because of the related bug 20612716, you should use this workaround when performing a live migration from any system that has firmware with Hypervisor 1.14.x to any system that has firmware with Hypervisor 1.13.x:

From a system that runs firmware version 8.7.x to a system that runs up to firmware version 8.6.x
From as system that runs firmware version 9.4.x to a system that runs up to firmware version 9.3.x

Workaround: To avoid the problem, add the following line to the /etc/system file on the domain being migrated:

set retained_mem_already_checked=1

For information about correctly creating or updating /etc/system property values, see Updating Property Values in the /etc/system File in Oracle VM Server for SPARC 3.2 Administration Guide .

Then, reboot the domain and retry the migration.

Domain Migration Might Fail Even Though Sufficient Memory in a Valid Layout Is Available on the Target System

Bug ID 20453206: A migration operation might fail even if sufficient memory in a valid layout is available on the target system. Memory DR operations might make it more difficult to migrate a guest domain.

Workaround: None.

Cannot Perform a Live Migration of a Guest Domain That Uses iSCSI Devices

Bug IDs 19163498 and 16585085: A logical domain that uses iSCSI devices cannot use live migration.

Kernel Zones Block Live Migration of Guest Domains

Bug ID 18289196: On a SPARC system, a running kernel zone within an Oracle VM Server for SPARC domain will block live migration of the guest domain if it runs one or more “down-revision” components. The following error message is displayed:

Live migration failed because Kernel Zones are active.
Stop Kernel Zones and retry.

Workaround: Choose one of the following workarounds:

Stop running the kernel zone.
```
# zoneadm -z zonename shutdown
```
Suspend the kernel zone.
```
# zoneadm -z zonename suspend
```

Oracle Solaris 10: Domains That Have Only One Virtual CPU Assigned Might Panic During a Live Migration

Bug ID 17285751: On the Oracle Solaris 10 OS, migrating a domain that has only one virtual CPU assigned to it might cause a panic on the guest domain in the function pg_cmt_cpu_fini().

Workaround: Assign at least two virtual CPUs to the guest domain before you perform the live migration. For example, use the ldm add-vcpu number-of-virtual-CPUs domain-name command to increase the number of virtual CPUs assigned to the guest domain.

Virtual Network Hang Prevents a Domain Migration

Bug ID 17191488: When attempting to migrate a domain from a SPARC T5-8 to a SPARC T4-4 system, the following error occurs:

primary# ldm migrate ldg1 system2
Target Password:
Timeout waiting for domain ldg1 to suspend
Domain Migration of LDom ldg1 failed

Workaround: To avoid this problem, set extended-mapin-space=on.

Note - This command initiates a delayed reconfiguration if domain-name is primary. In all other cases, stop the domain before you perform this command.

primary# ldm set-domain extended-mapin-space=on domain-name

Domain Migrations From SPARC T4 Systems That Run System Firmware 8.3 to SPARC T5, SPARC M5, or SPARC M6 Systems Are Erroneously Permitted

Bug ID 17027275: Domain migrations between SPARC T4 systems that run system firmware 8.3 should not be permitted to SPARC T5, SPARC M5, or SPARC M6 systems. Although the migration succeeds, a subsequent memory DR operation causes a panic.

Workaround: Update the system firmware on the SPARC T4 system to version 8.4. See the workaround for Guest Domain Panics at lgrp_lineage_add(mutex_enter: bad mutex, lp=10351178).

`ldm migrate -n` Should Fail When Cross-CPU Migration From SPARC T5, SPARC M5, or SPARC M6 System to UltraSPARC T2 or SPARC T3 System

Bug ID 16864417: The ldm migrate -n command does not report failure when attempting to migrate between a SPARC T5, SPARC M5, or SPARC M6 machine and an UltraSPARC T2 or SPARC T3 machine.

Workaround: None.

Migration of a Guest Domain With HIO Virtual Networks and `cpu-arch=generic` Times Out While Waiting for the Domain to Suspend

Bug ID 15825538: On a logical domain that is configured with both Hybrid network I/O interfaces (mode=hybrid) and cross-CPU migration enabled (cpu-arch=generic), if a secure live migration is executed (ldm migrate), the migration might time out and leave the domain in a suspended state.

Recovery: Restart the logical domain.

Workaround: Do not use hybrid I/O virtual network devices with secure cross-CPU live migration.

`ldm list -o status` on Target Control Domain Reports Bogus Migration Progress

Bug ID 15819714: In rare circumstances, the ldm list -o status command reports a bogus completion percentage when used to observe the status of a migration on a control domain.

This problem has no impact on the domain that is being migrated or on the ldmd daemons on the source or target control domains.

Workaround: Run the ldm list -o status command on the other control domain that is involved in the migration to observe the progress.

Oracle Solaris 10: Primary or Guest Domain Panics When Unbinding or Migrating a Guest Domain That Has Hybrid I/O Network Devices

Bug ID 15803617: The primary domain or an active guest domain might panic during an unbind operation or a live migration operation if the domain is configured with hybrid I/O virtual network devices.

Recovery: Restart the affected domain.

Workaround: Do not use hybrid I/O virtual network devices.

After Canceling a Migration, `ldm` Commands That Are Run on the Target System Are Unresponsive

Bug ID 15776752: If you cancel a live migration, the memory contents of the domain instance that is created on the target must be “scrubbed” by the hypervisor. This scrubbing process is performed for security reasons and must be complete before the memory can be returned to the pool of free memory. While this scrubbing is in progress, ldm commands become unresponsive. As a result, the Logical Domains Manager appears to be hung.

Recovery: You must wait for this scrubbing request to finish before you attempt to run other ldm commands. This process might take a long time. For example, a guest domain that has 500 Gbytes of memory might complete this process in up to 7 minutes on a SPARC T4 server or up to 25 minutes on a SPARC T3 server.

Guest Domain Panics When Running the `cputrack` Command During a Migration to a SPARC T4 System

Bug ID 15776123: If the cputrack command is run on a guest domain while that domain is migrated to a SPARC T4 system, the guest domain might panic on the target machine after it has been migrated.

Workaround: Do not run the cputrack command during the migration of a guest domain to a SPARC T4 system.

Guest Domain That Uses Cross-CPU Migration Reports Random Uptimes After the Migration Completes

Bug ID 15775055: After a domain is migrated between two machines that have different CPU frequencies, the uptime reports by the ldm list command might be incorrect. These incorrect results occur because uptime is calculated relative to the STICK frequency of the machine on which the domain runs. If the STICK frequency differs between the source and target machines, the uptime appears to be scaled incorrectly.

This issue only applies to UltraSPARC T2, UltraSPARC T2 Plus and SPARC T3 systems.

The uptime reported and shown by the guest domain itself is correct. Also, any accounting that is performed by the Oracle Solaris OS in the guest domain is correct.

Migrating a Very Large Memory Domain on SPARC T4-4 Systems Results in a Panicked Domain on the Target System

Bug ID 15731303: Avoid migrating domains that have over 500 Gbytes of memory. Use the ldm list -o mem command to see the memory configuration of your domain. Some memory configurations that have multiple memory blocks that total over 500 Gbytes might panic with a stack that resembles the following:

panic[cpu21]/thread=2a100a5dca0:
BAD TRAP: type=30 rp=2a100a5c930 addr=6f696e740a232000 mmu_fsr=10009

sched:data access exception: MMU sfsr=10009: Data or instruction address
out of range context 0x1

pid=0, pc=0x1076e2c, sp=0x2a100a5c1d1, tstate=0x4480001607, context=0x0
g1-g7: 80000001, 0, 80a5dca0, 0, 0, 0, 2a100a5dca0

000002a100a5c650 unix:die+9c (30, 2a100a5c930, 6f696e740a232000, 10009,
2a100a5c710, 10000)
000002a100a5c730 unix:trap+75c (2a100a5c930, 0, 0, 10009, 30027b44000,
2a100a5dca0)
000002a100a5c880 unix:ktl0+64 (7022d6dba40, 0, 1, 2, 2, 18a8800)
000002a100a5c9d0 unix:page_trylock+38 (6f696e740a232020, 1, 6f69639927eda164,
7022d6dba40, 13, 1913800)
000002a100a5ca80 unix:page_trylock_cons+c (6f696e740a232020, 1, 1, 5,
7000e697c00, 6f696e740a232020)
000002a100a5cb30 unix:page_get_mnode_freelist+19c (701ee696d00, 12, 1, 0, 19, 3)
000002a100a5cc80 unix:page_get_cachelist+318 (12, 1849fe0, ffffffffffffffff, 3,
0, 1)
000002a100a5cd70 unix:page_create_va+284 (192aec0, 300ddbc6000, 0, 0,
2a100a5cf00, 300ddbc6000)
000002a100a5ce50 unix:segkmem_page_create+84 (18a8400, 2000, 1, 198e0d0, 1000,
11)
000002a100a5cf60 unix:segkmem_xalloc+b0 (30000002d98, 0, 2000, 300ddbc6000, 0,
107e290)
000002a100a5d020 unix:segkmem_alloc_vn+c0 (30000002d98, 2000, 107e000, 198e0d0,
30000000000, 18a8800)
000002a100a5d0e0 genunix:vmem_xalloc+5c8 (30000004000, 2000, 0, 0, 80000, 0)
000002a100a5d260 genunix:vmem_alloc+1d4 (30000004000, 2000, 1, 2000,
30000004020, 1)
000002a100a5d320 genunix:kmem_slab_create+44 (30000056008, 1, 300ddbc4000,
18a6840, 30000056200, 30000004000)
000002a100a5d3f0 genunix:kmem_slab_alloc+30 (30000056008, 1, ffffffffffffffff,
0, 300000560e0, 30000056148)
000002a100a5d4a0 genunix:kmem_cache_alloc+2dc (30000056008, 1, 0, b9,
fffffffffffffffe, 2006)
000002a100a5d550 genunix:kmem_cpucache_magazine_alloc+64 (3000245a740,
3000245a008, 7, 6028f283750, 3000245a1d8, 193a880)
000002a100a5d600 genunix:kmem_cache_free+180 (3000245a008, 6028f2901c0, 7, 7,
7, 3000245a740)
000002a100a5d6b0 ldc:vio_destroy_mblks+c0 (6028efe8988, 800, 0, 200, 19de0c0, 0)
000002a100a5d760 ldc:vio_destroy_multipools+30 (6028f1542b0, 2a100a5d8c8, 40,
0, 10, 30000282240)
000002a100a5d810 vnet:vgen_unmap_rx_dring+18 (6028f154040, 0, 6028f1a3cc0, a00,
200, 6028f1abc00)
000002a100a5d8d0 vnet:vgen_process_reset+254 (1, 6028f154048, 6028f154068,
6028f154060, 6028f154050, 6028f154058)
000002a100a5d9b0 genunix:taskq_thread+3b8 (6028ed73908, 6028ed738a0, 18a6840,
6028ed738d2, e4f746ec17d8, 6028ed738d4)

Workaround: Avoid performing migrations of domains that have over 500 Gbytes of memory.

`nxge` Panics When Migrating a Guest Domain That Has Hybrid I/O and Virtual I/O Virtual Network Devices

Bug ID 15710957: When a heavily loaded guest domain has a hybrid I/O configuration and you attempt to migrate it, you might see an nxge panic.

Workaround: Add the following line to the /etc/system file on the primary domain and on any service domain that is part of the hybrid I/O configuration for the domain:

set vsw:vsw_hio_max_cleanup_retries = 0x200

All `ldm` Commands Hang When Migrations Have Missing Shared NFS Resources

Bug ID 15708982: An initiated or ongoing migration, or any ldm command, hangs forever. This situation occurs when the domain to be migrated uses a shared file system from another system and the file system is no longer shared.

Workaround: Make the shared file system accessible again.

Live Migration of a Domain That Depends on an Inactive Master Domain on the Target Machine Causes `ldmd` to Fault With a Segmentation Fault

Bug ID 15701865: If you attempt a live migration of a domain that depends on an inactive domain on the target machine, the ldmd daemon faults with a segmentation fault, and the domain on the target machine restarts. Although you can still perform a migration, it will not be a live migration.

Workaround: Perform one of the following actions before you attempt the live migration:

Remove the guest dependency from the domain to be migrated.
Start the master domain on the target machine.

DRM Fails to Restore the Default Number of Virtual CPUs for a Migrated Domain When the Policy Is Removed or Expired

Bug ID 15701853: After you perform a domain migration while a DRM policy is in effect, if the DRM policy expires or is removed from the migrated domain, DRM fails to restore the original number of virtual CPUs to the domain.

Workaround: If a domain is migrated while a DRM policy is active and the DRM policy is subsequently expired or removed, reset the number of virtual CPUs. Use the ldm set-vcpu command to set the number of virtual CPUs to the original value on the domain.

Migration Failure Reason Not Reported When the System MAC Address Clashes With Another MAC Address

Bug ID 15699763: A domain cannot be migrated if it contains a duplicate MAC address. Typically, when a migration fails for this reason, the failure message shows the duplicate MAC address. However in rare circumstances, this failure message might not report the duplicate MAC address.

# ldm migrate ldg2 system2
Target Password:
Domain Migration of LDom ldg2 failed

Workaround: Ensure that the MAC addresses on the target machine are unique.

Simultaneous Migration Operations in “Opposite Direction” Might Cause `ldm` to Hang

Bug ID 15696986: If two ldm migrate commands are issued simultaneously in the “opposite direction,” the two commands might hang and never complete. An opposite direction situation occurs when you simultaneously start a migration on machine A to machine B and a migration on machine B to machine A.

The hang occurs even if the migration processes are initiated as dry runs by using the –n option. When this problem occurs, all other ldm commands might hang.

Workaround: None.

Migration of a Domain That Has an Enabled Default DRM Policy Results in a Target Domain Being Assigned All Available CPUs

Bug ID 15655513: Following the migration of an active domain, CPU utilization in the migrated domain can increase dramatically for a short period of time. If a dynamic resource managment (DRM) policy is in effect for the domain at the time of the migration, the Logical Domains Manager might begin to add CPUs. In particular, if the vcpu-max and attack properties were not specified when the policy was added, the default value of unlimited causes all the unbound CPUs in the target machine to be added to the migrated domain.

Recovery: No recovery is necessary. After the CPU utilization drops below the upper limit that is specified by the DRM policy, the Logical Domains Manager automatically removes the CPUs.

Memory DR Is Disabled Following a Canceled Migration

Bug ID 15646293: After an Oracle Solaris 10 9/10 domain has been suspended as part of a migration operation, memory dynamic reconfiguration (DR) is disabled. This action occurs not only when the migration is successful but also when the migration has been canceled, despite the fact that the domain remains on the source machine.

Migrated Domain With MAUs Contains Only One CPU When Target OS Does Not Support DR of Cryptographic Units

Bug ID 15606220: Starting with the Logical Domains 1.3 release, a domain can be migrated even if it has one or more cryptographic units bound to it.

In the following circumstances, the target machine will contain only one CPU after the migration is completed:

Target machine runs Logical Domains 1.2
Control domain on the target machine runs a version of the Oracle Solaris OS that does not support cryptographic unit DR
You migrate a domain that contains cryptographic units

After the migration completes, the target domain will resume successfully and be operational, but will be in a degraded state (just one CPU).

Workaround: Prior to the migration, remove the cryptographic unit or units from the source machine that runs Logical Domains 1.3.

Mitigation: To avoid this issue, perform one or both of these steps:

Install the latest Oracle VM Server for SPARC software on the target machine.
Install patch ID 142245-01 on the control domain of the target machine, or upgrade to at least the Oracle Solaris 10 10/09 OS.

Explicit Console Group and Port Bindings Are Not Migrated

Bug ID 15527921: During a migration, any explicitly assigned console group and port are ignored, and a console with default properties is created for the target domain. This console is created using the target domain name as the console group and using any available port on the first virtual console concentrator (vcc) device in the control domain. If there is a conflict with the default group name, the migration fails.

Recovery: To restore the explicit console properties following a migration, unbind the target domain and manually set the desired properties using the ldm set-vcons command.

Migration Does Not Fail If a `vdsdev` on the Target Has a Different Back End

Bug ID 15523133: If the virtual disk on the target machine does not point to the same disk back end that is used on the source machine, the migrated domain cannot access the virtual disk using that disk back end. A hang can result when accessing the virtual disk on the domain.

Currently, the Logical Domains Manager checks only that the virtual disk volume names match on the source and target machines. In this scenario, no error message is displayed if the disk back ends do not match.

Workaround: When you are configuring the target domain to receive a migrated domain, ensure that the disk volume (vdsdev) matches the disk back end used on the source domain.

Recovery: Do one of the following if you discover that the virtual disk device on the target machine points to the incorrect disk back end:

Migrate the domain and fix the vdsdev.
1. Migrate the domain back to the source machine.
2. Fix the vdsdev on the target to point to the correct disk back end.
3. Migrate the domain to the target machine again.
Stop and unbind the domain on the target, and fix the vdsdev. If the OS supports virtual I/O dynamic reconfiguration and the incorrect virtual disk in not in use on the domain (that is, it is not the boot disk and is unmounted), do the following:
1. Use the ldm rm-vdisk command to remove the disk.
2. Fix the vdsdev.
3. Use the ldm add-vdisk command to add the virtual disk again.

Migration Can Fail to Bind Memory Even If the Target Has Enough Available

Bug ID 15523120: In certain situations, a migration fails and ldmd reports that it was not possible to bind the memory needed for the source domain. This situation can occur even if the total amount of available memory on the target machine is greater than the amount of memory being used by the source domain.

This failure occurs because migrating the specific memory ranges in use by the source domain requires that compatible memory ranges are available on the target as well. When no such compatible memory range is found for any memory range in the source, the migration cannot proceed. See Migration Requirements for Memory in Oracle VM Server for SPARC 3.2 Administration Guide .

Recovery: If this condition is encountered, you might be able to migrate the domain if you modify the memory usage on the target machine. To do this, unbind any bound or active logical domain on the target.

Use the ldm list-devices -a mem command to see what memory is available and how it is used. You might also need to reduce the amount of memory that is assigned to another domain.

Cannot Connect to Migrated Domain's Console Unless `vntsd` Is Restarted

Bug ID 15513998: Occasionally, after a domain has been migrated, it is not possible to connect to the console for that domain.

Workaround: Restart the vntsd SMF service to enable connections to the console:

# svcadm restart vntsd

Note - This command will disconnect all active console connections.