12.4 Troubleshooting Virtual Machines

The section contains information on known issues you may encounter when creating or using virtual machine, and explains how to resolve them.

12.4.1 Setting the Guest's Clock

PVM guests may perform their own system clock management, for example, using the NTPD (Network Time Protocol daemon), or the hypervisor may perform system clock management for all guests.

You can set paravirtualized guests to manage their own system clocks by setting the xen.independent_wallclock parameter to 1 in the /etc/sysctl.conf file. For example:

"xen.independent_wallclock = 1"

If you want to set the hypervisor to manage paravirtualized guest system clocks, set xen.independent_wallclock to 0. Any attempts to set or modify the time in a guest will fail.

You can temporarily override the setting in the /proc file. For example:

"echo 1 > /proc/sys/xen/independent_wallclock"
Note

This setting does not apply to hardware virtualized guests.

12.4.2 Wallclock Time Skew Problems

Additional parameters may be needed in the boot loader (grub.conf) configuration file for certain operating system variants after the guest is installed. Specifically, for optimal clock accuracy, Linux guest boot parameters should be specified to ensure that the pit clock source is utilized. Adding clock=pit nohpet nopmtimer for most guest will result in the selection of pit as the clock source for the guest. Published templates for Oracle VM include these additional parameters.

Proper maintenance of virtual time can be tricky. The various parameters provide tuning for virtual time management and supplement, but do not replace, the need for an ntp time service running within guest. Ensure that the ntpd service is running and that the /etc/ntp.conf configuration file is pointing to valid time servers.

12.4.3 Mouse Pointer Tracking Problems

If your mouse pointer fails to track your cursor in a VNC Viewer session in a hardware virtualized guest, add the following to the Oracle VM Server configuration file located at /etc/xen/xend-config.sxp to force the device model to use absolute (tablet) coordinates:

usbdevice='tablet'

Restart the Oracle VM Server for the changes to take effect. You may need to do this for each Oracle VM Server in the server pool.

12.4.4 Cloning Virtual Machine from Oracle VM 2.x Template Stuck in Pending

When creating a virtual machine from an Oracle VM 2.x template, the clone job fails with the error:

OVMAPI_9039E Cannot place clone VM: template_name.tgz, in Server Pool: server-pool-uuid.
That server pool has no servers that can run the VM.

This is caused by a network configuration inconsistency with the vif = ['bridge=xenbr0'] entry in the virtual machine's configuration file.

To resolve this issue, remove any existing networks in the virtual machine template, and replace them with valid networks which have the Virtual Machine role. Start the clone job again and the virtual machine clone is created. Alternatively, remove any existing networks in the template, restart the clone job, and add in any networks after the clone job is complete.

12.4.5 Hardware Virtualized Guest Stops

When running hardware virtualized guests, the QEMU process (qemu-dm) may have its memory usage grow substantially, especially under heavy I/O loads. This may cause the hardware virtualized guest to stop as it runs out of memory. If the guest is stopped, increase the memory allocation for dom0, for example from 512 MB to 768 MB. See Section 1.6, “Changing the Memory Size of the Management Domain” for information on changing the dom0 memory allocation.

12.4.6 Migrating Virtual Machines

You cannot migrate virtual machines on computers with hardware that is not identical. To migrate virtual machines, you must have hardware that is the same make and model and the CPU must be in the same CPU family.

Virtual machines can be live migrated between instances of Oracle VM Server that are at the same release or later. For virtual machines running on an x86 platform, a rule exception is generated if you attempt to live migrate a virtual machine to an Oracle VM Server with an earlier release than the Oracle VM Server where the virtual machine is running.

12.4.7 Recovering From A Failed Local Virtual Machine Migration

In the event where a virtual machine hosted on a local repository is live migrated and the migration source, or target, Oracle VM Server becomes unavailable during the migration, Oracle VM Manager attempts to perform a rollback of the operation. This rollback process brings the original version of the virtual machine back online on the source Oracle VM Server and then performs a cleanup operation on the target Oracle VM Server when it becomes available again. This cleanup process involves killing the paused virtual machine that may have been copied to the target Oracle VM Server and then cleaning the target repository of virtual disks, virtual machine configurations and temporary files. Finally a repository refresh is performed on the repository on the source server to ensure that everything is in order.

Before the cleanup operation is triggered, an event is created within Oracle VM Manager to indicate that the migration job has failed or been aborted and to track the rollback process. When the event is generated within Oracle VM Manager, it is set with a 'WARNING' status. The rollback process is generated as a set of up to three different jobs that are each given a timeout period of 15 minutes, and which are triggered to attempt to run every 10 seconds. If these jobs succeed, Oracle VM Manager acknowledges the event. If the jobs all timeout, Oracle VM Manager still acknowledges the event, but a second user-acknowledgeable event is created with 'WARNING' status to indicate that the rollback failed. Depending on the cause of the rollback failure, Oracle VM Manager might also create user-acknowledgeable events with 'CRITICAL' status.

Because jobs are usually performed sequentially, it may take a total of 45 minutes before the entire rollback process times out and the new event indicating rollback failure is generated. The rollback failure event is also logged in the the log file /u01/app/oracle/ovm-manager-3/domains/ovm_domain/servers/AdminServer/logs/AdminServer.log on the Oracle VM Manager host.

The information in the rollback failure event contains the rollback plan that Oracle VM Manager attempted to follow to cleanup a failed virtual machine migration. This event can be viewed using the getEventsForObject command with the Oracle VM Manager Command Line Interface, by viewing the events associated with the virtual machine within the Oracle VM Manager Web Interface or via the Oracle VM Web Services API.

The following content represents the typical output displayed within the description field for a rollback failure event:

Live VM Migration With Storage, started at 2015-11-04 09:51:13,205

    VM: [VirtualMachineDbImpl] 0004fb0000060000c71d489702c240b3<978> (MyVM)

    Source server: [ServerDbImpl] 30:30:38:37:30:32:58:4d:51:34:35:30:30:37:4c:52<386> (ovs216)
    Target server: [ServerDbImpl] 30:30:38:37:30:32:58:4d:51:34:35:30:30:38:58:42<238> (ovs215)

    Source repository: [RepositoryDbImpl] 0004fb0000030000c4fca9a963e2706c<479> (r216)
    Target repository: [RepositoryDbImpl] 0004fb0000030000f9ba13d5063e330a<382> (r215)

    Source vDisks to be migrated:
        /OVS/Repositories/0004fb0000030000c4fca9a963e2706c/VirtualDisks/0004fb000012000005213553a5bba24f.img

Migration job has failed or was aborted.
VM's server has been set back to: (ovs216)
Source vDisk files have been retained.

Constructed the following post-migration completion plan at 2015-11-04 09:51:36,686

    VM to be killed on server: (ovs215)

    To be deleted on target server (ovs215):
            vDisk: /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualDisks/0004fb000012000005213553a5bba24f.img
         tmp file: /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualDisks/tmp_dest.0004fb000012000005213553a5bba24f.img

    Also to be deleted on target server (ovs215):
         cfg file: /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/0004fb0000060000c71d489702c240b3/vm.cfg
        directory: /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/0004fb0000060000c71d489702c240b3
         tmp file: /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/tmp_dest.0004fb0000060000c71d489702c240b3

    Source repository (r216) must be refreshed.
    

Note that the description of the event provides detailed information about the migration process and indicates that the migration job has failed. The message explains that the virtual machine is set back to run on the source server and that the source virtual disks have been retained. This means that the virtual machine may either be running or stopped on the source server, but from the perspective of Oracle VM Manager, the location of the virtual machine has been reverted. Most significantly, the output contains a 'post-migration completion plan'. This plan provides a full breakdown of the steps that must be performed to roll the environment back to its original state.

If an event like this appears for a failed migration of a locally hosted virtual machine, you must manually perform the rollback steps on the target server when it next becomes available. It is very important that you ensure that the rollback steps are performed on the systems indicated in the post-migration completion plan. Performing any of these steps on another server could have detrimental effects and could result in virtual machine corruption.

Kill the Virtual Machine on the Indicated Oracle VM Server

The first step in this plan involves killing the virtual machine on the indicated Oracle VM Server or servers. Depending on the state of the migration at the time that the target Oracle VM Server became unavailable, this may be require an action on either the target Oracle VM Server or both the target and source Oracle VM Servers. In some cases you may not need to perform this action on either Oracle VM Server. The appropriate action is logged in the event description.

During the migration, the virtual machine enters into a paused state as it is copied from the source Oracle VM Server to the target Oracle VM Server. Once the copy is complete, the virtual machine on the target Oracle VM Server is not indicated within Oracle VM Manager in any way, as this would conflict with the virtual machine with the identical UUID that is located on the original source Oracle VM Server. This transition is performed within Oracle VM Manager when the migration is complete. As a result two virtual machines with identical UUIDs may exist within the environment for the period of the migration. If the target server goes offline at any point during the migration, it is frequently the case that at least one of these virtual machines must be killed off to prevent conflict. Since the representation of the virtual machine within Oracle VM Manager is not reliable until the rollback has been completed, it is necessary that you must perform the kill operation directly on the indicated Oracle VM Server. This is usually done over SSH as the root user, using the following command:

ovs-agent-rpc stop_vm "''" "'0004fb0000060000c71d489702c240b3'" "True"

Note that 0004fb0000060000c71d489702c240b3 should match the UUID of the virtual machine that you were originally migrating. Also pay attention to the quotes in each of the arguments presented here. The first argument for this command is empty, so a pair of single quotes are enclosed in a pair of double-quotes. The second argument is the UUID of the virtual machine that you intend to kill and is represented as enclosed in a pair of single quotes within a pair of double-quotes. Finally, the last argument is used to force the action and contains the text True enclosed in a pair of double-quotes.

Note that you should use this command to stop the virtual machine because it helps to identify the correct virtual machine domain to destroy, it maintains the integrity of your environment and logs any actions carried out on the virtual machine. Do not attempt to use Xen hypervisor tools to perform any actions on the virtual machine directly without explicit instruction from an Oracle Support representative.

Remove any Virtual Disks from the Repository on the Target Oracle VM Server

A live migration of a virtual machine that is hosted on local storage also requires that any virtual disks are copied from the repository hosted on the source server across to the repository of the target server. Therefore, it is necessary that you manually delete any of these files from the repository hosted on the target Oracle VM Server to clean the environment. To do this, you must SSH to the target Oracle VM Server and delete the files listed in the plan returned in the event description. For example:

rm -f /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualDisks/0004fb000012000005213553a5bba24f.img
rm -f /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualDisks/tmp_dest.0004fb000012000005213553a5bba24f.img

Remove the Virtual Machine Configuration from the Repository on the Target Oracle VM Server

The virtual machine configuration for the virtual machine is also copied from the repository hosted on the source server across to the repository of the target server during the migration. Therefore, it is necessary that you manually delete any of these files and directories from the repository hosted on the target Oracle VM Server to clean the environment. To do this, you must SSH to the target Oracle VM Server and delete the files listed in the plan returned in the event description. For example:

rm -f /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/0004fb0000060000c71d489702c240b3/vm.cfg
rm -rf /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/0004fb0000060000c71d489702c240b3
rm -f /OVS/Repositories/0004fb0000030000f9ba13d5063e330a/VirtualMachines/tmp_dest.0004fb0000060000c71d489702c240b3

Refresh the Repository on the Source Oracle VM Server

During the migration process, Oracle VM Manager updates its model of the source and target repositories hosted on each Oracle VM Server to match the environment as it would be after the migration is complete. It does not revert this representation unless an automated rollback is achieved. If the rollback has failed and you have performed manual steps to revert your environment to its original state, you must also refresh the repository within Oracle VM Manager so that the model accurately reflects the state of the repository. You can either do this using the Oracle VM Manager Web Interface or you can use the Oracle VM Manager Command Line Interface directly. For example:

ssh -l admin localhost -p 10000 refresh repository name="r216"

At this point, your environment should be completely reverted.

12.4.8 Migrating Large Hardware Virtualized Guest Results in CPU Soft Lock

On some hardware, such as the SUN FIRE X4170 M2 Server, migration of very large virtual machines using hardware virtualization can result in a soft lockup causing the virtual machine to become unresponsive. This lock is caused when the migration causes the virtual machine kernel to lose the clock source. Access to the console for the virtual machine shows a series of error messages similar to the following:

BUG: soft lockup - CPU#0 stuck for 315s! [kstop/0:2131]

To resolve this, the virtual machine must be restarted and the clocksource=jiffies option should be added to the HVM guest kernel command line, before rebooting the virtual machine again.

Important

This option should only be used on HVM guest systems that have actually resulted in a CPU soft lock.

12.4.9 Hardware Virtualized Guest Devices Not Working as Expected

Some devices, such as sound cards, may not work as expected in hardware virtualized guests. In a hardware virtualized guest, a device that requires physical memory addresses instead uses virtualized memory addresses, so incorrect memory location values may be set. This is because DMA (Direct Memory Access) is virtualized in hardware virtualized guest.

Hardware virtualized guest operating systems expect to be loaded in memory starting somewhere around address 0 and upwards. This is only possible for the first hardware virtualized guest loaded. Oracle VM Server virtualizes the memory address to be 0 to the size of allocated memory, but the guest operating system is actually loaded at another memory location. The difference is fixed up in the shadow page table, but the operating system is unaware of this.

For example, a sound is loaded into memory in a hardware virtualized guest running Microsoft Windows™ at an address of 100 MB may produce garbage through the sound card, instead of the intended audio. This is because the sound is actually loaded at 100 MB plus 256 MB. The sound card receives the address of 100 MB, but it is actually at 256 MB.

An IOMMU (Input/Output Memory Management Unit) in the computer's memory management unit would remove this problem as it would take care of mapping virtual addresses to physical addresses, and enable hardware virtualized guests direct access to the hardware.

12.4.10 Paravirtualized Guest Disk Devices are Not Recognized

If you opt to create a PVHVM or PVM, you must ensure that all disks that the virtual machine is configured to use are configured as paravirtual devices, or they may not be recognized by the virtual machine. If you discover that a disk or virtual cdrom device is not being recognized by your virtual machine, you may need to edit the vm.cfg file for the virtual machine directly. To do this, determine the UUID of the virtual machine, and then locate the configuration file in the repository, for example on an Oracle VM Server:

# vi /OVS/Repositories/UUID/vm.cfg

Locate each disk entry that contains a hardware device such as hda, hdb, or hdc and replace with an xvd mapping, such as xvda, xvdb, xvdc etc.

Restart the virtual machine with the new configuration, to check that it is able to discover the disk or virtual cdrom device.

12.4.11 Cannot Create a Virtual Machine from Installation Media

When creating a virtual machine, the following message may be displayed:

Error: There is no server supporting hardware virtualization in the selected server pool.

To resolve this issue, make sure the Oracle VM Server supports hardware virtualization. Follow these steps to check:

  1. Run the following command to check if hardware virtualization is supported by the CPU:

    # cat /proc/cpuinfo |grep -E 'vmx|smx'

    If any information that contains vmx or smx is displayed, it means that the CPU supports hardware virtualization. Here is an example of the returned message:

    flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr 
    sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
    Note

    The /proc/cpuinfo command only shows virtualization capabilities starting with Linux 2.6.15 (Intel®) and Linux 2.6.16 (AMD). Use the uname -r command to query your kernel version.

  2. Make sure you have enabled hardware virtualization in the BIOS.

  3. Run the following command to check if the operating system supports hardware virtualization:

    # xm info |grep hvm

    The following is an example of the returned message:

    xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x 

If the CPU does not support hardware virtualization, use the paravirtualized method to create the virtual machine. See the Servers and VMs Tab section in the Oracle VM Manager Online Help for information on creating a paravirtualized virtual machine.

12.4.12 Cannot Change CD in the Virtual Machine

To change the CD in a virtual machine:

  1. Unmount the first CD:

    # umount mount-point
  2. Select the second ISO file, and click Change CD.

  3. Mount the second CD:

    # mount /dev/cdrom mount-point

12.4.13 Generating Guest Dump Files on Oracle VM Server (x86)

The Xen hypervisor makes it possible to generate a core dump file for a virtual machine in the case that it crashes. This file can be useful for debugging and support purposes. Core dump files can be large and to avoid overwriting files, each file is named uniquely. When this facility is enabled, core dump files are saved to /var/xen/dump on the Oracle VM Server where the virtual machine was running when it crashed. This can rapidly use up available disk space on the dom0 system partition. If you enable this facility, you must ensure that enough disk space is available at this path on the Oracle VM Server, either by mounting an additional disk at this path, or by creating a symbolic link for this path to point to an alternate location with plenty of available disk space.

By default, this facility is disable at a system-wide level. It is possible to change this behavior by editing /etc/xen/xend-config.sxp directly and changing the lines:

# Whether to enable core-dumps when domains crash.
#(enable-dump no)

to:

# Whether to enable core-dumps when domains crash.
(enable-dump yes)

After making this change, you must reboot the Oracle VM Server for the change to take effect. Manually editing the global Xen configuration parameters on an Oracle VM Server is not supported by Oracle.

It is possible to override the system-wide behavior by setting this parameter directly in the vm.cfg for each individual virtual machine. This is the preferred approach to generating dump files, as it allows you to limit core dumps to only those virtual machines that you are interested in debugging. Therefore, this configuration option can be controlled for each virtual machine from within Oracle VM Manager. You can set this option by configuring the Restart Action On Crash option for a virtual machine. See the Servers and VMs Tab section in the Oracle VM Manager Online Help for more information on this parameter.

If you change the Restart Action On Crash option for a virtual machine, you must stop the virtual machine and then start it again before the change takes effect. This is different to restarting the virtual machine, as the vm.cfg configuration file for the virtual machine is only read by the Xen hypervisor when the virtual machine is started. If you have made the configuration change but have not properly restarted the virtual machine, a crash and reboot does not automatically cause the configuration option to take effect.

To test whether or not the core dump facility is working properly for a virtual machine, you may be able to directly trigger a crash by logging into the virtual machine and obtaining root privileges before issuing the following command:

# echo c >/proc/sysrq-trigger

This command assumes that the operating system on the virtual machine is Linux-based and that the System Request trigger is enabled within the kernel. After you have triggered the crash, check /var/xen/dump on the Oracle VM Server where the virtual machine was running to view the dump file.

12.4.14 Tuning a Linux-based Virtual Machine for Handling Storage Migration

When a virtual machine is hosted in a repository using local storage on the Oracle VM Server where it is running, migration of that virtual machine to another Oracle VM Server and repository requires that I/O on affected disks is not excessively high. If you are running an application that has high I/O during a migration, it may cause the guest or the application to hang. Steps can be taken to mitigate against this on guests that are running a Linux operating system by tuning virtual memory caching parameters within the guest kernel and by reducing the ext4 journaling commit frequency on any guests that may be running file sytems that are formatted with ext4.

Tuning virtual memory caching.  On the guest command line, as the root user, you can tune the cache by using the sysctl command to set a number of kernel parameters. Oracle recommends that you reduce the cache size to 5% of the system memory (the default value is 10) and reduce the time that a memory page can remain dirty until it is flushed to around 20 seconds (the default is 30 seconds). You can do this temporarily by running the following commands:

# sysctl -w vm.dirty_background_ratio=5
# sysctl -w vm.dirty_expire_centisecs=2000

Alternatively edit /etc/sysctl.conf and add these lines:

vm.dirty_background_ratio=5
vm.dirty_expire_centisecs=2000

When you have done this, you can load these values into the kernel by running sysctl -p.

Tuning ext4 journaling.  If the guest is using any filesystems that are formatted to use ext4, the journaling commit process may be affected by a migration. To protect against this, decrease the amount of time between journal commits and ensure that commits are performed asynchronously. To do this, you should tune your mount parameters for any ext4 filesystem that you have mounted within the guest. For example when mounting an ext4 formatted filesystem you might use the following options:

# mount -o commit=5,journal_async_commit /dev/xvdd /vdisk3

To perform this effectively for all ext4 mounts, you may need to edit your /etc/fstab.