20.3 Troubleshooting OCFS2

20.3.1 Recommended Tools for Debugging
20.3.2 Mounting the debugfs File System
20.3.3 Configuring OCFS2 Tracing
20.3.4 Debugging File System Locks
20.3.5 Configuring the Behavior of Fenced Nodes

The following sections describes some techniques that you can use for investigating any problems that you encounter with OCFS2.

20.3.1 Recommended Tools for Debugging

To you want to capture an oops trace, it is recommended that you set up netconsole on the nodes.

If you want to capture the DLM's network traffic between the nodes, you can use tcpdump. For example, to capture TCP traffic on port 7777 for the private network interface eth1, you could use a command such as the following:

# tcpdump -i eth1 -C 10 -W 15 -s 10000 -Sw /tmp/`hostname -s`_tcpdump.log \
  -ttt 'port 7777' &

You can use the debugfs.ocfs2 command, which is similar in behavior to the debugfs command for the ext3 file system, and allows you to trace events in the OCFS2 driver, determine lock statuses, walk directory structures, examine inodes, and so on.

For more information, see the debugfs.ocfs2(8) manual page.

The o2image command saves an OCFS2 file system's metadata (including information about inodes, file names, and directory names) to an image file on another file system. As the image file contains only metadata, it is much smaller than the original file system. You can use debugfs.ocfs2 to open the image file, and analyze the file system layout to determine the cause of a file system corruption or performance problem.

For example, the following command creates the image /tmp/sda2.img from the OCFS2 file system on the device /dev/sda2:

# o2image /dev/sda2 /tmp/sda2.img

For more information, see the o2image(8) manual page.

20.3.2 Mounting the debugfs File System

OCFS2 uses the debugfs file system to allow access from user space to information about its in-kernel state. You must mount the debugfs file system to be able to use the debugfs.ocfs2 command.

To mount the debugfs file system, add the following line to /etc/fstab:

debugfs    /sys/kernel/debug      debugfs  defaults  0 0

and run the mount -a command.

20.3.3 Configuring OCFS2 Tracing

The following table shows some of the commands that are useful for tracing problems in OCFS2.

Command

Description

debugfs.ocfs2 -l

List all trace bits and their statuses.

debugfs.ocfs2 -l SUPER allow

Enable tracing for the superblock.

debugfs.ocfs2 -l SUPER off

Disable tracing for the superblock.

debugfs.ocfs2 -l SUPER deny

Disallow tracing for the superblock, even if implicitly enabled by another tracing mode setting.

debugfs.ocfs2 -l HEARTBEAT \

ENTRY EXIT allow

Enable heartbeat tracing.

debugfs.ocfs2 -l HEARTBEAT off \

ENTRY EXIT deny

Disable heartbeat tracing. ENTRY and EXIT are set to deny as they exist in all trace paths.

debugfs.ocfs2 -l ENTRY EXIT \

NAMEI INODE allow

Enable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT \

deny NAMEI INODE allow

Disable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT \

DLM DLM_THREAD allow

Enable tracing for the DLM.

debugfs.ocfs2 -l ENTRY EXIT \

deny DLM DLM_THREAD allow

Disable tracing for the DLM.

One method for obtaining a trace its to enable the trace, sleep for a short while, and then disable the trace. As shown in the following example, to avoid seeing unnecessary output, you should reset the trace bits to their default settings after you have finished.

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow && sleep 10 && \
  debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

To limit the amount of information displayed, enable only the trace bits that you believe are relevant to understanding the problem.

If you believe a specific file system command, such as mv, is causing an error, the following example shows the commands that you can use to help you trace the error.

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow
# mv source destination & CMD_PID=$(jobs -p %-)
# echo $CMD_PID
# debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

As the trace is enabled for all mounted OCFS2 volumes, knowing the correct process ID can help you to interpret the trace.

For more information, see the debugfs.ocfs2(8) manual page.

20.3.4 Debugging File System Locks

If an OCFS2 volume hangs, you can use the following steps to help you determine which locks are busy and the processes that are likely to be holding the locks.

  1. Mount the debug file system.

    # mount -t debugfs debugfs /sys/kernel/debug
  2. Dump the lock statuses for the file system device (/dev/sdx1 in this example).

    # echo "fs_locks" | debugfs.ocfs2 /dev/sdx1 >/tmp/fslocks 62
    Lockres: M00000000000006672078b84822 Mode: Protected Read
    Flags: Initialized Attached
    RO Holders: 0 EX Holders: 0
    Pending Action: None Pending Unlock Action: None
    Requested Mode: Protected Read Blocking Mode: Invalid

    The Lockres field is the lock name used by the DLM. The lock name is a combination of a lock-type identifier, an inode number, and a generation number. The following table shows the possible lock types.

    Identifier

    Lock Type

    D

    File data.

    M

    Metadata.

    R

    Rename.

    S

    Superblock.

    W

    Read-write.

  3. Use the Lockres value to obtain the inode number and generation number for the lock.

    # echo "stat <M00000000000006672078b84822>" | debugfs.ocfs2 -n /dev/sdx1
    Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
    ... 
  4. Determine the file system object to which the inode number relates by using the following command.

    # echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdx1
    419616 /linux-2.6.15/arch/i386/kernel/semaphore.c
  5. Obtain the lock names that are associated with the file system object.

    # echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | \
      debugfs.ocfs2 -n /dev/sdx1
    M00000000000006672078b84822 D00000000000006672078b84822 W00000000000006672078b84822  

    In this example, a metadata lock, a file data lock, and a read-write lock are associated with the file system object.

  6. Determine the DLM domain of the file system.

    # echo "stats" | debugfs.ocfs2 -n /dev/sdX1 | grep UUID: | while read a b ; do echo $b ; done
    82DA8137A49A47E4B187F74E09FBBB4B  
  7. Use the values of the DLM domain and the lock name with the following command, which enables debugging for the DLM.

    # echo R 82DA8137A49A47E4B187F74E09FBBB4B \
      M00000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug  
  8. Examine the debug messages.

    # dmesg | tail
    struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=3, key=965960985
      lockres: M00000000000006672078b84822, owner=1, state=0 last used: 0, 
      on purge list: no granted queue:
          type=3, conv=-1, node=3, cookie=11673330234144325711, ast=(empty=y,pend=n), 
          bast=(empty=y,pend=n) 
        converting queue:
        blocked queue:  

    The DLM supports 3 lock modes: no lock (type=0), protected read (type=3), and exclusive (type=5). In this example, the lock is mastered by node 1 (owner=1) and node 3 has been granted a protected-read lock on the file-system resource.

  9. Run the following command, and look for processes that are in an uninterruptable sleep state as shown by the D flag in the STAT column.

    # ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

    At least one of the processes that are in the uninterruptable sleep state will be responsible for the hang on the other node.

If a process is waiting for I/O to complete, the problem could be anywhere in the I/O subsystem from the block device layer through the drivers to the disk array. If the hang concerns a user lock (flock()), the problem could lie in the application. If possible, kill the holder of the lock. If the hang is due to lack of memory or fragmented memory, you can free up memory by killing non-essential processes. The most immediate solution is to reset the node that is holding the lock. The DLM recovery process can then clear all the locks that the dead node owned, so letting the cluster continue to operate.

20.3.5 Configuring the Behavior of Fenced Nodes

If a node with a mounted OCFS2 volume believes that it is no longer in contact with the other cluster nodes, it removes itself from the cluster in a process termed fencing. Fencing prevents other nodes from hanging when they try to access resources held by the fenced node. By default, a fenced node restarts instead of panicking so that it can quickly rejoin the cluster. Under some circumstances, you might want a fenced node to panic instead of restarting. For example, you might want to use netconsole to view the oops stack trace or to diagnose the cause of frequent reboots. To configure a node to panic when it next fences, run the following command on the node after the cluster starts:

# echo panic > /sys/kernel/config/cluster/cluster_name/fence_method

where cluster_name is the name of the cluster. To set the value after each reboot of the system, add this line to /etc/rc.local. To restore the default behavior, use the value reset instead of panic.