1.4 Diagnostic Tools for Oracle VM Server

As an optional post-installation step, Oracle recommends that you also install and configure diagnostics tools on all Oracle VM Servers. These tools can be used to help debug and diagnose issues such as system crashes, hanging, unscheduled reboots, and OCFS2 cluster errors. The output from these tools can be used by Oracle Support and can significantly improve resolution and response times.

Obtaining a system memory dump, vmcore, can be very useful when attempting to diagnose and resolve the root cause of an issue. To get a useful vmcore dump, a kdump service configuration is required. See Section 1.4.2, “Manually Configuring kdump for Oracle VM Server” below for more information on this.

In addition, you can install netconsole, a utility allowing system console messages to be redirected across the network to another server. See the Oracle Support Document, How to Configure "netconsole" for Oracle VM Server 3.0, for information on how to install netconsole.

https://support.oracle.com/

Additional information on using diagnostic tools is provided in the Oracle Linux documentation. See the chapter titled Support Diagnostic Tools in the Oracle Linux Administrator's Solutions Guide.

http://docs.oracle.com/cd/E37670_01/E37355/html/ol_diag.html

1.4.1 Working with the OSWatcher Utility on Oracle VM Server

OSWatcher (oswbb) is a collection of shell scripts that collect and archive operating system and network metrics to diagnose performance issues with Oracle VM Server. OSWatcher operates as a set of background processes to gather data with standard UNIX utilities such as vmstat, netstat and iostat.

By default, OSWatcher is installed on Oracle VM Server and is enabled to run at boot. The following table describes the OSWatcher program and main configuration file:

Name	Description
`/usr/sbin/OSWatcher`	The main OSWatcher program. If required, you can configure certain parameters for statistics collection. However, you should do so only if Oracle Support advises you to change the default configuration.
`/etc/sysconfig/oswatcher`	This file defines the directory where OSWatcher log files are saved, the interval between statistics collection, and the maximum amount of time to retain archived statistics. Important It is not possible to specify a limit to the data that the OSWatcher utility collects. For this reason, you should be careful when modifying the default configuration so that the OSWatcher utility does not use all available space on the system disk.

Name

Description

/usr/sbin/OSWatcher

The main OSWatcher program. If required, you can configure certain parameters for statistics collection. However, you should do so only if Oracle Support advises you to change the default configuration.

/etc/sysconfig/oswatcher

This file defines the directory where OSWatcher log files are saved, the interval between statistics collection, and the maximum amount of time to retain archived statistics.

Important

It is not possible to specify a limit to the data that the OSWatcher utility collects. For this reason, you should be careful when modifying the default configuration so that the OSWatcher utility does not use all available space on the system disk.

To start, stop, and check the status of OSWatcher, use the following command:

# service oswatcher {start|stop|status|restart|reload|condrestart}

For detailed information on the data that OSWatcher collects and how to analyze the output, as well as for instructions on sending the data to Oracle Support, see the OSWatcher User Guide in the following directory on Oracle VM Server: /usr/share/doc/oswatcher-x.x.x/

1.4.2 Manually Configuring kdump for Oracle VM Server

While Oracle VM Server uses the robust UEK4 kernel which is stable and fault-tolerant and should rarely encounter errors that crash the entire system, it is still possible that a system-wide error results in a kernel crash. Information about the actual state of the system at the time of a kernel crash is critical to accurately debug issues and to resolve them. The kdump service is used to capture the memory dump from dom0 and store it on the filesystem. The service does not dump any system memory used by guest virtual machines, so the memory dump is specific to dom0 and the Xen hypervisor itself. The memory dump file that is generated by kdump is referred to as the vmcore file.

A description of the actions required to manually configure Oracle VM Server so that the kdump service is properly enabled and running is provided here, so that you are able to set up and enable this service after an installation. The Oracle VM Server installer provides an option to enable kdump at installation where many of these steps are performed automatically. See Kdump Setting in the Oracle VM Installation and Upgrade Guide for more information on this.

Checking Pre-requisite Packages

By default, the required packages to enable the kdump service are included within the Oracle VM Server installation, but it is good practice to check that these are installed before continuing with any configuration work. You can do this by running the following command:

# rpm -qa | grep kexec-tools

If the kexec-tools package is not installed, you must install it manually.

Updating the GRUB2 Configuration

Oracle VM Server makes use of GRUB2 to handle the boot process. In this step, you must configure GRUB2 to pass the crashkernel parameter to the Xen kernel at boot. This can be done by editing the /etc/default/grub file and modifying the GRUB_CMDLINE_XEN variable by appending the appropriate crashkernel parameter.

The crashkernel parameter specifies the amount of space used in memory to load the crash kernel that is used to generate the dump file, and also specifies the offset which is the beginning of the crash kernel region in memory. The minimum amount of RAM that may be specified for a crash kernel is 512 MB and this should be offset by 64 MB. This would result in a configuration that looks similar to the following:

GRUB_CMDLINE_XEN="dom0_mem=max:6144M allowsuperpage dom0_vcpus_pin \
dom0_max_vcpus=20 crashkernel=512M@64M"

This setting is sufficient for the vast majority of systems, however on systems that make use of a significant number of large drivers, the crash kernel may need to be allocated more space in memory. If you force a dump and it fails to generate a core file, you may need to increase the amount of memory allocated to the crash kernel.

Important

While UEK4 supports the crashkernel=auto option, the Xen hypervisor does not. You must specify values for the RAM reservation and offset used for the crash kernel or the kdump service is unable to run.

When you have finished modifying /etc/default/grub, you must rebuild the system GRUB2 configuration that is used at boot time. This is done by running:

# grub2-mkconfig -o /boot/grub2/grub.cfg

Optionally Preparing a Local Filesystem to Store Dump Files

Kdump is able to store vmcore files in a variety of locations, including network accessible filesystems. By default, vmcore files are stored in /var/crash/, but this may not be appropriate depending on your disk partitioning and available space. The filesystem where the vmcore files are stored must have enough space to match the amount of memory available to Oracle VM Server for each dump.

Since the installation of Oracle VM Server only uses as much disk space as is required, a 'spare' partition is frequently available on a new installation. This partition is left available for use for hosting a local repository or for alternate use such as for hosting vmcore files generated by kdump. If you opt to use it for this purpose, you must first correctly identify and take note of the UUID of the partition and then format it with a usable filesystem.

The following steps serve as an illustration of how you might prepare the local spare partition.

Identify the partition that the installer left 'spare' after the installation. This is usually listed under /dev/mapper with a filename that starts with OVM_SYS_REPO_PART. If you can identify this device, you can format it with an ext4 filesystem:
```
# mkfs.ext4 /dev/mapper/OVM_SYS_REPO_PART_VBd64a21cf-db4a5ad5
```
If you don't have a partition mapped like this, you may need to use a utilities like blkls, parted, fdisk or gdisk to identify any free partitions on your available disk devices.
Obtain the UUID for the filesystem. You can do this by running the blkid command:
```
# blkid /dev/mapper/OVM_SYS_REPO_PART_VBd64a21cf-db4a5ad5
/dev/mapper/OVM_SYS_REPO_PART_VBd64a21cf-db4a5ad5: 
UUID="51216552-2807-4f17-ab27-b8135f69896d" TYPE="ext4"
```
Take note of the UUID as you will need to use this later when you configure kdump.

Modifying the kdump Configuration

System configuration directing how the kdump service runs is defined in /etc/sysconfig/kdump, while specific kdump configuration variables are defined in /etc/kdump.conf. Changes may need to be made to either of these files depending on your environment. However, the default configuration should be sufficient to run kdump initially without any problems. The following list identifies potential configuration changes that you may wish to make:

On systems with lots of memory (e.g. over 1 TB), it is advisable to disable the IO Memory Management Unit within the crash kernel for performance and stability reasons. This is achieved by editing /etc/sysconfig/kdump and appending the iommu=off kernel boot parameter to the KDUMP_COMMANDLINE_APPEND variable:
```
KDUMP_COMMANDLINE_APPEND="irqpoll maxcpus=1 nr_cpus=1 reset_devices cgroup_disable=memory
      mce=off selinux=0 iommu=off"
```
If you intend to change the partition where the vmcore files are stored, using the spare partition on the server after installation for instance, you must edit /etc/kdump.conf to provide the filesystem type and device location of the partition. If you followed the instructions above, it is preferable that you do this by specifying the UUID that you obtained for the partition using the blkid command. A line similar to the following should appear in the configuration:
```
ext4 UUID=51216552-2807-4f17-ab27-b8135f69896d
```
You may edit the default path where vmcore files are stored, but note that this path is relative to the partition that kdump is configured to use to store vmcores. If you have configured kdump to store vmcores on a separate filesystem, when you mount the filesystem, the vmcore files are located in the path specified by this directive on the mounted filesystem:
```
path /var/crash
```
If you are having issues obtaining a vmcore or you are finding that your vmcore files are particularly large using the makedumpfile utility, you may reconfigure kdump to use the cp command to copy the vmcore in sparse mode. To do this, edit /etc/kdump.conf to comment out the line containing setting the core_collector to use the makedumpfile utility and uncomment the lines to enable the cp command:
```
# core_collector makedumpfile -EXd 1 --message-level 1 --non-cyclic
core_collector cp --sparse=always 
extra_bins /bin/cp
```
Your mileage with this may vary, and the makedumpfile utility is generally recommended instead.

Enabling the kdump Service

You can enable the kdump service to run at every boot by running the following command:

# chkconfig kdump on

You must restart the kdump service at this point to allow it to detect the changes that have been made to the kdump configuration and to determine whether a kdump crash kernel has been generated and is up to date. If the kernel image needs to be updated, kdump does this automatically, otherwise it restarts without any attempt to rebuild the crash kernel image:

# service kdump restart 
Stopping kdump:                   [  OK  ]
 Detected change(s) the following file(s): 
    /etc/kdump.conf 
Rebuilding /boot/initrd-4.1.12-25.el6uek.x86_64kdump.img 

Starting kdump:                   [  OK  ]

Confirming that kdump is Configured and Working Correctly

You can confirm that the kernel loaded for dom0 is correctly configured, by running the following command and checking that output is returned to show that your crashkernel parameter is in use:

# xl dmesg|grep -i crashkernel
(XEN) Command line: placeholder dom0_mem=max:6144M allowsuperpage dom0_vcpus_pin
        dom0_max_vcpus=20 crashkernel=512M@64M

You can also check that the appropriate amount of memory is reserved for kdump by running the following:

# xl dmesg|grep -i kdump 
(XEN) Kdump: 512MB (524288kB) at 0x4000000

or alternately:

# kexec --print-ckr-size 
536870912

You can check that the kdump service is running by checking the service status:

# service kdump status
Kdump is operational

If there are no errors in /var/log/messages or on the console, you can assume that kdump is running correctly.

To test that kdump is able to generate a vmcore and store it correctly, you can trigger a kernel panic by issuing the following commands:

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger

Note

These commands cause the kernel on the Oracle VM Server to panic and crash. If kdump is working correctly, the crash kernel should take over and generate the vmcore file which is copied to the configured location before the server reboots automatically. If kdump fails to load the crash kernel, the server may hang with the kernel panic and requires a hard-reset to reboot.

After you have triggered a kernel panic and the system has successfully rebooted, you may check that the vmcore file was properly generated:

If you have not configured kdump to use an alternate partition, you should be able to locate the vmcore file in /var/crash/127.0.0.1-date-time/vmcore, where date and time represent the date and time when the vmcore was generated.
If you configured kdump to use an alternate partition to store the vmcore file, you must mount it first. If you used the spare partition generated by a fresh installation of Oracle VM Server, this can be done in the following way:
```
# mount /dev/mapper/OVM_SYS_REPO_PART_VBd64a21cf-db4a5ad5 /mnt
```
You may then find the vmcore file in /mnt/var/crash/127.0.0.1-date-time/vmcore, where date and time represent the date and time when the vmcore was generated, for example:
```
# file /mnt/var/crash/127.0.0.1-2015-12-08-16\:12\:28/vmcore
/mnt/var/crash/127.0.0.1-2015-12-08-16:12:28/vmcore: ELF 64-bit LSB
      core file x86-64, version 1 (SYSV), SVR4-style
```
Remember to unmount the partition after you have obtained the vmcore file for analysis, so that it is free for use by kdump.

If you find that a vmcore file is not being created or that the system hangs without automatically rebooting, you may need to adjust your configuration. The most common problem is that there is insufficient memory allocated for the crash kernel to run and complete its operations. Your starting point to resolving issues with kdump is always to try increasing the reserved memory that is specified in your GRUB2 configuration.