3 Managing the Oracle Cluster File System Version 2 in Oracle Linux

This chapter includes information about managing the Oracle Cluster File System Version 2 (OCFS2) in Oracle Linux 8. It includes tasks for configuring, administering, and troubleshooting OCFS2.

In Oracle Linux 8, Oracle Cluster File Ssystem Version 2 (OCFS2) is supported on Unbreakable Enterprise Kernel (UEK) releases only, starting with Unbreakable Enterprise Kernel Release 6 (UEK R6).

For information about local file system management in Oracle Linux, see Oracle Linux 8: Managing Local File Systems.

About OCFS2

OCFS2 (Oracle Cluster File System Version 2) is a general-purpose shared-disk file system that is intended for use with clusters. OCFS2 offers high performance as well as high availability. Optionally, you can also mount an OCFS2 volume on a standalone, non-clustered system.

Using OCFS2 offers the following benefits:

  • You can use the reflink command with OCFS2 to create copy-on-write clones of individual files. You can also use the cp --reflink command in a similar way that you would on a Btrfs file system. Typically, such clones enable you to save disk space when storing multiple copies of very similar files, such as virtual machine (VM) images or Linux Containers.

    Note that when using the reflink command, the resulting file system behaves like a clone of the original files ystem, which means that their UUIDs are identical. When using the reflink command to create a clone, you must change the UUID by using the tunefs.ocfs2 command. See Querying and Changing Volume Parameters.

  • Mounting a local OCFS2 file system enables you to subsequently migrate the file system to a cluster file system without requiring any conversion.
  • OCFS2 provides local file-system semantics. Therefore, almost all applications can use OCFS2. Applications that are cluster-aware can use cache-coherent parallel I/O from multiple cluster nodes to balance activity across the cluster, or they can use of the available file-system functionality to fail over and run on another node in the event that a node fails.

OCFS2 Use Cases

The following are some typical use cases for OCFS2.

Load Balancing Use Case

You can use OCFS2 nodes to share resources between client systems. For example, the nodes could export a shared file system by using Samba or NFS. To distribute service requests between the nodes, you can use round-robin DNS, a network load balancer; or, you can specify which node should be used on each client.

Oracle Real Application Cluster Use Case

Oracle Real Application Cluster (RAC) uses its own cluster stack, Cluster Synchronization Services (CSS). You can use O2CB in conjunction with CSS, but note that each stack is configured independently for timeouts, nodes, and other cluster settings. You can use OCFS2 to host the voting disk files and the Oracle cluster registry (OCR), but not the grid infrastructure user's home, which must exist on a local file system on each node.

Because both CSS and O2CB use the lowest node number as a tie breaker in quorum calculations, ensure that the node numbers are the same in both clusters. If necessary, edit the O2CB configuration file, /etc/ocfs2/cluster.conf, to make the node numbering consistent. Then, update this file on all of the nodes. The change takes effect when the cluster is restarted.

Oracle Database Use Case

Specify the noatime option when mounting volumes that host Oracle datafiles, control files, redo logs, voting disk, and OCR. The noatime option disables unnecessary updates to the access time on the inodes.

Specify the nointr mount option to prevent signals interrupting I/O transactions that are in progress.

By default, the init.ora parameter filesystemio_options directs the database to perform direct I/O to the Oracle datafiles, control files, and redo logs. You should also specify the datavolume mount option for volumes that contain the voting disk and OCR. Do not specify this option for volumes that host the Oracle user's home directory or Oracle E-Business Suite.

To prevent database blocks from becoming fragmented across a disk, ensure that the file system cluster size is at minimum as large as the database block size, which is typically 8KB. If you specify the file system usage type as datafiles when using the mkfs.ocfs2 command, the file system cluster size is set to 128KB.

To enable multiple nodes to maximize throughput by concurrently streaming data to an Oracle datafile, OCFS2 deviates from the POSIX standard by not updating the modification time (mtime) on the disk when performing non-extending direct I/O writes. The value of mtime is updated in memory. However, OCFS2 does not write the value to disk unless an application extends or truncates the file or performs a operation to change the file metadata, such as using the touch command. This behavior leads to results with different nodes reporting different time stamps for the same file. Use the following command to view the on-disk timestamp of a file:

sudo debugfs.ocfs2 -R "stat /file_path" device | grep "mtime:"

Setting Up an OCFS2 Cluster

A cluster consists of members called nodes. For best performance, each node in the cluster should have at least two network interfaces. The first interface is connected to a public network to allow general access to the systems, while the second interface is used for private communications between the nodes and the cluster heartbeat. This second interface determines how the cluster nodes coordinate their access to shared resources and how they monitor each other's state.

Important:

Both network interfaces must be connected through a network switch. Additionally, you must ensure that all of the network interfaces are configured and working before configuring the cluster.

Planning for an OCFS2 Cluster

Aside from setting up the proper equipment, prepare the following for the cluster:

  • Designated cluster members, specifically, their hostnames and corresponding IP addresses.
  • Heartbeat mode to use in the cluster.

In a cluster configuration, small packets travel throughout the entire setup over a specific UDP port, including all the networks configured for cluster. These packets establish routes between the cluster nodes and become indicators of whether the cluster network is up or down. Essentially, the packets are the way you can determine the health of the cluster. For this reason, these packets are also called heartbeats.

You can configure a cluster to run in either of the following heartbeat modes:

  • Local heartbeat thread for each shared device (default heartbeat mode).

    In this configuration, a node starts a heartbeat thread when the node mounts an OCFS2 volume and stops the thread when the node unmounts the volume. CPU overhead is large on nodes that mount a large number of OCFS2 volumes because each mount requires a separate heartbeat thread. Likewise, a large number of mounts increases the risk of a node becoming isolated out of the cluster, also known as fencing, due to a heartbeat I/O timeout on a single mount.

  • Global heartbeat on specific shared devices.

    This mode enables you to configure any OCFS2 volume as a global heartbeat device, provided that the volume occupies a whole disk device and not a partition. In this mode, the heartbeat to the device starts when the cluster becomes online and stops when the cluster goes offline. This mode is recommended for clusters that mount a large number of OCFS2 volumes. A node fences itself out of the cluster if a heartbeat I/O timeout occurs on more than half of the global heartbeat devices. To provide redundancy against failure of one of the devices, you should configure at least three global heartbeat devices.

The following figure shows a cluster of four nodes that are connected by using a network switch to a LAN and a network storage server. The nodes and storage server are also connected by using a switch to a private network that is used for the local cluster heartbeat.

Figure 3-1 Cluster Configuration by Using a Private Network


The figure shows a cluster of four nodes that are connected by using a network switch to a LAN and a network storage server. The nodes and the storage server are also connected by using a switch to a private network, which is used for the cluster heartbeat.

Although you can configure and use OCFS2 without using a private network, such a configuration increases the probability of a node fencing itself out of the cluster due to an I/O heartbeat timeout.

The following table provides recommendations for minimum cluster size settings for different file system size ranges:

File System Size Suggested Minimum Cluster Size

1 GB - 10 GB

8K

10GB - 100 GB

16K

100 GB - 1 TB

32K

1 TB - 10 TB

64K

10 TB - 16 TB

128K

Installing the Cluster Software

You can configure a cluster whose nodes use mixed versions of OCFS2 and the UEK software. This configuration is supported for scenarios where you are performing a rolling update of a cluster. In these cases, the cluster node that is running the lowest version of the software determines the set of usable features. However, you should preferably use the same version of the OCFS2 software and a compatible UEK release on all the nodes of the cluster to ensure efficiency in cluster operations.

For a tutorial on how to configure OCFS2, see Use Oracle Cloud Cluster File System Tools on Oracle Cloud Infrastructure.

Important:

Perform the following procedure on each designated node.
  1. Display designated node's kernel version.
    sudo uname -r

    The output on all the nodes should display the same version, for example, 5.4.17-2136.302.7.2.1.el8uek.x86_64. If necessary, update each node to ensure that all of the nodes are running the same kernel version.

  2. Install the OCFS2 packages.
    sudo dnf install -y ocfs2-tools

    Note:

    If you want to use the global heartbeat feature, install the ocfs2-tools-1.8.0-11 or later package as well.

  3. Configure the firewall to allow access on the interface that the cluster uses for private cluster communication.

    By default, the cluster uses both TCP and UDP over port 7777.

    sudo firewall-cmd --permanent --add-port=7777/tcp --add-port=7777/udp
  4. Disable SELinux.

    With a text editor, open the /etc/selinux/config file and disable SELinux on the appropriate line entry in the file, as shown in this example in bold.

    # This file controls the state of SELinux on the system.
    # SELINUX= can take one of these three values:
    #     enforcing - SELinux security policy is enforced.
    #     permissive - SELinux prints warnings instead of enforcing.
    #     disabled - No SELinux policy is loaded.
    SELINUX=disabled

Configuring the Cluster Layout

  1. On any one of the designated cluster member, create a cluster definition.

    sudo o2cb add-cluster cluster-name
  2. Add the current system and all of the other designated nodes to the cluster.

    sudo o2cb add-node cluster-name hostname --ip ip_address

    The IP address should be the IP address that is used by the node for private communication in the cluster.

    Suppose that you want to create a cluster mycluster out of the following systems as identified by their host names. Further, you are using ol-sys1 to complete this configuration.

    • ol-sys0: 10.1.0.100
    • ol-sys1: 10.1.0.110
    • ol-sys2: 10.1.0.120

    To add nodes to mycluster, you would type the following commands on ol-sys1:

    sudo o2cb add-node mycluster ol-sys0 --ip 10.1.0.100
    sudo ol2cb add-node mycluster ol-sys1 --ip 10.1.0.110
    sudo ol2cb add-node mycluster ol-sys2 --ip 10.1.0.120

    Note:

    OCFS2 only supports IPv4 addresses.

  3. If you want the cluster to run in global heartbeat mode, do the following:

    1. Run the following command on every cluster device:
      sudo o2cb add-heartbeat cluster-name device-name

      For example, for mycluster, you set the cluster heartbeat on /dev/sdd, /dev/sdg, and /dev/sdj as follows:

      sudo o2cb add-heartbeat mycluster /dev/sdd
      sudo o2cb add-heartbeat mycluster /dev/sdg
      sudo o2cb add-heartbeat mycluster /dev/sdj
    2. Set the cluster's heartbeat mode to global.
      sudo o2cb heartbeat-mode cluster-name global

    Important:

    You must configure the global heartbeat feature to use whole disk devices. Global heartbeat devices on disk partitions are not supported.

  4. Copy the cluster /etc/ocfs2/cluster.conf file to each node in the cluster.

  5. (Optional) Display information about the cluster.
    sudo o2cb list-cluster cluster-name

    A 3-node cluster mycluster with a local heartbeat would display information similar to the following:

    node:
    	name = ol-sys0
    	cluster = mycluster
    	number = 0
    	ip_address = 10.1.0.100
    	ip_port = 7777
    
    node:
            name = ol-sys1
            cluster = mycluster
            number = 1
            ip_address = 10.1.0.110
            ip_port = 7777
    
    node:
            name = ol-sys2
            cluster = mycluster
            number = 2
            ip_address = 10.1.0.120
            ip_port = 7777
    
    cluster:
            name = mycluster
            heartbeat_mode = local
            node_count = 3

    The same cluster but with a global heartbeat would have the following information:

    node:
            name = ol-sys0
            cluster = mycluster
            number = 0
            ip_address = 10.1.0.100
            ip_port = 7777
    
    node:
            name = ol-sys1
            cluster = mycluster
            number = 1
            ip_address = 10.1.0.110
            ip_port = 7777
    
    node:
            name = ol-sys2
            cluster = mycluster
            number = 2
            ip_address = 10.1.0.120
            ip_port = 7777
    
    cluster:
            name = mycluster
            heartbeat_mode = global
            node_count = 3
    
    heartbeat:
            cluster = mycluster
            region = 7DA5015346C245E6A41AA85E2E7EA3CF
    
    heartbeat:
            cluster = mycluster
            region = 4F9FBB0D9B6341729F21A8891B9A05BD
    
    heartbeat:
            cluster = mycluster
            region = B423C7EEE9FC426790FC411972C91CC3

    The heartbeat regions are represented by the UUIDs of their block devices.

Configuring and Starting the O2CB Cluster Stack

Important:

Perform the following procedure on each designated node.
  1. Configure the node.

    sudo /sbin/o2cb.init configure

    The configuration process prompts you for additional information. Some of the parameters you need to set are the following:

    • Load the O2CB driver on boot: Specify y or n, which is the default setting.
    • Cluster to start at boot: Specify the name of your cluster. The name must match that as found in the /etc/ocfs2/cluster.conf file.
    • For the rest of the parameters for which you are prompted for information, the default settings are typically sufficient. However, you can specify values other than the default if necessary.
  2. Verify the settings for the cluster stack.

    sudo /sbin/o2cb.init status

    A cluster that uses local heartbeat mode would display information similar to the following:

    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "mycluster": Online
      Heartbeat dead threshold: 61
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Local
    Checking O2CB heartbeat: Active

    A cluster that uses global heartbeat mode would display information similar to the following:

    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "mycluster": Online
      Heartbeat dead threshold: 61
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Global
    Checking O2CB heartbeat: Active
      7DA5015346C245E6A41AA85E2E7EA3CF /dev/sdd
      4F9FBB0D9B6341729F21A8891B9A05BD /dev/sdg
      B423C7EEE9FC426790FC411972C91CC3 /dev/sdj
  3. Enable the o2cb and ocfs2 services so that they start at boot time after networking is enabled.

    sudo systemctl enable o2cb
    sudo systemctl enable ocfs2
  4. Using a text editor, set the following kernel settings for cluster operations in the /etc/sysctl.d/99-sysctl.conf file:
    • kernel.panic = 30

      This setting specifies the number of seconds after a panic before a system automatically resets itself. The default value is zero, which means that the system hangs after a panic. A non-zero value enables the system to automatically reset. If you require a memory image to be created before the system hangs, assign a higher value.

    • kernel.panic_on_oops = 1

      This setting enables the system to panic if a kernel oops occurs. Thus, if a kernel thread required for cluster operation crashes, the system resets itself. Otherwise, another node might not be able to tell whether a node is slow to respond or unable to respond, which eventually causes all cluster operations to hang.

  5. Apply the configuration by running the following command:
    sudo sysctl -p

The o2cb.init command accepts other subcommands to enable you to administer the cluster, such as the following:

  • /sbin/o2cb.init status: Check the status of the cluster stack.
  • /sbin/o2cb.init online: Stop the cluster stack.
  • /sbin/o2cb.init offline: Start the cluster stack.
  • /sbin/o2cb.init unload: Unload the cluster stack.

To determine other available subcommands, type the command o2cb.init by itself.

Working With OCFS2 Volumes

To configure OCFS2 volumes, you use mkfs.ocfs2 as the main command. To enable you to create the volume according to specific requirements, the command accepts different options and arguments, including the following:

-b blocksize, --block-size blocksize

Specifies the unit size for I/O transactions to and from the file system, and the size of inode and extent blocks. The supported block sizes are 512 (512 bytes), 1K, 2K, and 4K. The default and recommended block size is 4K (4 kilobytes).

-C clustersize, --cluster-size clustersize

Specifies the unit size for space used to allocate file data. The supported cluster sizes are 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, and 1M (1 megabyte). The default cluster size is 4K (4 kilobytes).

--fs-feature-level=feature-level
Enables you select a set of file-system features from the following choices:
  • default: Enables support for the sparse files, unwritten extents, and inline data features.
  • max-compat: Enables only those features that are understood by older versions of OCFS2.
  • max-features:

    Enables all features that OCFS2 currently supports.

--fs_features=feature

Enables you to enable or disable individual features such as support for sparse files, unwritten extents, and backup superblocks. For more information, see the mkfs.ocfs2(8) manual page.

-J journalsize, --journal-size journalsize

Specifies the size of the write-ahead journal. If not specified, the size is determined from the file system usage type that you specify to the -T option, and, otherwise, from the volume size. The default size of the journal is 64M (64 MB) for datafiles, 256M (256 MB) for mail, and 128M (128 MB) for vmstore.

-L label, --label label

Specifies a descriptive name for the volume that allows you to identify it easily on different cluster nodes.

-N number-of-slots, --node-slots number-of-slots

Determines the maximum number of nodes that can concurrently access a volume, which is limited by the number of node slots for system files such as the file-system journal. For best performance, set the number of node slots to at least twice the number of nodes. If you subsequently increase the number of node slots, performance can suffer because the journal will no longer be contiguously laid out on the outer edge of the disk platter.

-T file-system-usage-type

Specifies the type of usage for the file system, which is one of the following three options:

  • datafiles: Database files are typically few in number, fully allocated, and relatively large. Such files require few metadata changes, and do not benefit from having a large journal.
  • mail: Mail server files are typically many in number, and relatively small. Such files require many metadata changes, and benefit from having a large journal.
  • vmstore: Virtual machine image files are typically few in number, sparsely allocated, and relatively large. Such files require a moderate number of metadata changes and a medium sized journal.

Creating and Mounting OCFS2 Volumes

When creating OCFS2 volumes, keep the following additional points in mind:

  • Do not create an OCFS2 volume on an LVM logical volume, as LVM is not cluster-aware.

  • After you have created an OCFS2 volume, you cannot change the block and cluster size of that volume. You can use the tunefs.ocfs2 command to modify other settings for the file system, with certain restrictions. For more information, see the tunefs.ocfs2(8) manual page.

  • If you intend that the volume store database files, do not specify a cluster size that is smaller than the block size of the database.

  • The default cluster size of 4 KB is not suitable if the file system is larger than a few gigabytes.

  1. Create an OCFS2 volume.
    sudo mkfs.ocfs2 -L "myvol" /dev/sdc1

    This command creates the volume with a label on the specified device. Without additional options or arguments, specifications, the creates a volume that uses default values for some of its properties, such as 4 KB block and cluster size, eight node slots, 256 MB journal, and support for default file system features. This volume with default settings is suitable for file systems that are no larger than a few gigabytes.

    Tip:

    Ensure that the device corresponds to a partition so that you can refer to the label when mounting the volume.

    The options you use with the mkfs.ocfs2 command determines the volume you are creating. Consider the following examples:

    • Create a labeled volume for use as a database.
      sudo mkfs.ocfs2 -L "dbvol" -T datafiles /dev/sdd2

      In this case, the cluster size is set to 128 KB and the journal size to 32 MB.

    • Create a volume with specific property settings.
      sudo mkfs.ocfs2 -C 16K -J size=128M -N 16 --fs-feature-level=max-features --fs-features=norefcount /dev/sde1

      In this case, cluster and journal sizes are specified, as well as the number of node slots. Likewise, all of the supported features are enabled for use except norefcount trees.

  2. On each cluster member, mount the created volume.
    1. Create a mount point.
      sudo mkdir /u01
    2. Mount the volume.
      sudo mount -L myvol /u01
    3. Check the status of the heartbeat mode.
      sudo o2cb.init status

      The heartbeat becomes active after the volume is mounted.

Optionally, you can automate the mount operation by adding an entry to the /etc/fstab, for example:

myvol  /u01   ocfs2     _netdev,defaults  0 0

In the entry, _netdev informs the system to mount an OCFS2 volume at boot time only after networking is started. Likewise, the system should unmount the file system before networking is stopped.

Querying and Changing Volume Parameters

Use the tunefs.ocfs2 command to query or change volume parameters.

For example, to find out the label, UUID, and number of node slots for a volume, you would use the following command:

sudo tunefs.ocfs2 -Q "Label = %V\nUUID = %U\nNumSlots =%N\n" /dev/sdb
Label = myvol
UUID = CBB8D5E0C169497C8B52A0FD555C7A3E
NumSlots = 4

You would generate a new UUID for a volume by using the following commands:

sudo tunefs.ocfs2 -U /dev/sda
sudo tunefs.ocfs2 -Q "Label = %V\nUUID = %U\nNumSlots =%N\n" /dev/sdb
Label = myvol
UUID = 48E56A2BBAB34A9EB1BE832B3C36AB5C
NumSlots = 4

Creating a Local OCFS2 File System

The following procedure describes how to create an OCFS2 file system to be mounted locally, which is not associated with a cluster.

To create an OCFS2 file system that is to be mounted locally, use the following command syntax:

sudo mkfs.ocfs2 -M local --fs-features=local -N 1 [options] device

Consider the following example:, you would create a locally mountable OCFS2 volume on /dev/sdc1, with one node slot and the label localvol, as follows:

sudo mkfs.ocfs2 -M local --fs-features=local -N 1 -L "localvol" /dev/sdc1

The command creates a locally mountable volume with the label localvol and a single node slot.

To convert a local OCTFS2 file system to cluster use, use the tunefs.ocfs2 utility as follows:

sudo umount /dev/sdc1
sudo tunefs.ocfs2 -M cluster --fs-features=clusterinfo -N 8 /dev/sdc1

The previous example also increases the number of node slots from 1 to 8, to allow up to eight nodes to mount the file system.

Troubleshooting OCFS2 Issues

Refer to the following information when investigating how to resolve issues that you might encounter when administering OCFS2.

Recommended Debugging Tools and Practices

Use the following tools to troubleshoot OCFS2 issues:

  • It is recommended that you set up netconsole on the nodes to capture an oops trace.

  • Use the tcpdump command to capture the DLM's network traffic between nodes. For example, to capture TCP traffic on port 7777 for the private network interface em2, you could use the following command:

    sudo tcpdump -i em2 -C 10 -W 15 -s 10000 -Sw /tmp/`hostname -s`_tcpdump.log \
    -ttt 'port 7777' &
  • Use the debugfs.ocfs2 command to trace events in the OCFS2 driver, determine lock statuses, walk directory structures, examine inodes, and so on. This command is similar in behavior to the debugfs command that is used for the ext3 file system.

    For more information, see the debugfs.ocfs2(8) manual page.

  • Use the o2image command to save an OCFS2 file system's metadata, including information about inodes, file names, and directory names, to an image file on another file system. Because the image file contains only metadata, it is much smaller than the original file system. You can use the debugfs.ocfs2 command to open the image file and analyze the file system layout to determine the cause of a file system corruption or performance problem.

    For example, to create the image /tmp/sda2.img from the OCFS2 file system on the device /dev/sda2, you would use the following command:
    sudo o2image /dev/sda2 /tmp/sda2.img

    For more information, see the o2image(8) manual page.

Mounting the debugfs File System

OCFS2 uses the debugfs file system to enable userspace access to information about its in-kernel state. You must mount the debugfs file system to use the debugfs.ocfs2 command.

For example, to mount the debugfs file system, add the following line to the /etc/fstab file:

debugfs    /sys/kernel/debug      debugfs  defaults  0 0

Then, run the mount -a command.

Configuring OCFS2 Tracing

You can use the following commands and methods to trace issues in OCFS2.

Commands for Tracing OCFS2 Issues

The following list describes several commands that are useful for tracing issues.

debugfs.ocfs2 -l

Lists all of the trace bits and their statuses.

debugfs.ocfs2 -l SUPER allow|off|deny

Enables, disables, or disallows tracing for the superblock, respectively. If you specify deny, then even if another tracing mode setting implicitly allows it, tracing is still disallowed.

debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow

Enable heartbeat tracing.

debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny

Disable heartbeat tracing. ENTRY and EXIT parameters are set to deny, as these parameters exist in all trace paths.

debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow

Enable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE allow

Disable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT DLM DLM_THREAD allow

Enable tracing for the DLM.

debugfs.ocfs2 -l ENTRY EXIT deny DLM DLM_THREAD allow

Disable tracing for the DLM.

OCFS2 Tracing Methods and Examples

One method that you can use to obtain a trace is to first enable the trace, sleep for a short while, and then disable the trace. As shown in the following example, to avoid unnecessary output, you should reset the trace bits to their default settings after you have finished tracing:

sudo debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow && sleep 10 &&
sudo debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

To limit the amount of information that is displayed, enable only the trace bits that are relevant to diagnosing the problem.

If a specific file system command, such as mv, is causing an error, you might use commands that are shown in the following example to trace the error:

sudo debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow
mv source destination & CMD_PID=$(jobs -p %-)
echo $CMD_PID
sudo debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

Because the trace is enabled for all mounted OCFS2 volumes, knowing the correct process ID can help you to interpret the trace.

For more information, see the debugfs.ocfs2(8) manual page.

Debugging File System Locks

If an OCFS2 volume hangs, you can use the following procedure to determine which locks are busy and which processes are likely to be holding the locks.

In the following procedure, the Lockres value refers to the lock name that is used by DLM, which is a combination of a lock-type identifier, inode number, and a generation number. The following table lists the various lock types and their associated identifier.

Table 3-1 DLM Lock Types

Identifier Lock Type

D

File data

M

Metadata

R

Rename

S

Superblock

W

Read-write

  1. Mount the debug file system.

    sudo mount -t debugfs debugfs /sys/kernel/debug
  2. Dump the lock statuses for the file system device, which is /dev/sdx1 in the following example:

    echo "fs_locks" | sudo debugfs.ocfs2 /dev/sdx1 | sudo tee /tmp/fslocks
    Lockres: M00000000000006672078b84822 Mode: Protected Read
    ...
  3. Use the Lockres value from the previous output to obtain the inode number and generation number for the lock.

    sudo echo "stat lockres-value" | sudo debugfs.ocfs2 -n /dev/sdx1

    For example, for the Locres value M00000000000006672078b84822 from the previous step, the command output might resemble the following:

    Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
    ... 
  4. Determine the file system object to which the inode number in the previous output relates.

    sudo echo "locate inode" | sudo debugfs.ocfs2 -n /dev/sdx1

    For example, for the Inode value 419616 from the previous step, the command output might resemble the following:

    419616 /linux-2.6.15/arch/i386/kernel/semaphore.c
  5. Obtain the lock names that are associated with the file system object, which in the previous step's output is /linux-2.6.15/arch/i386/kernel/semaphore.c. Thus, you would type:

    sudo echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | sudo debugfs.ocfs2 -n /dev/sdx1
    M00000000000006672078b84822 D00000000000006672078b84822 W00000000000006672078b84822  

    In the previous example, a metadata lock, a file data lock, and a read-write lock are associated with the file system object.

  6. Determine the DLM domain of the file system by running the following command:

    sudo echo "stats" | sudo debugfs.ocfs2 -n /dev/sdX1 | grep UUID: | while read a b ; do echo $b ; done
    82DA8137A49A47E4B187F74E09FBBB4B  
  7. Using the values of the DLM domain and the lock name that enables debugging for the DLM, run the following command:

    sudo echo R 82DA8137A49A47E4B187F74E09FBBB4B M00000000000006672078b84822 | sudo tee /proc/fs/ocfs2_dlm/debug
  8. Examine the debug messages by using the dmesg | tail command, for example:

    struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=3, key=965960985
      lockres: M00000000000006672078b84822, owner=1, state=0 last used: 0, 
      on purge list: no granted queue:
          type=3, conv=-1, node=3, cookie=11673330234144325711, ast=(empty=y,pend=n), 
          bast=(empty=y,pend=n) 
        converting queue:
        blocked queue:  

    The DLM supports three lock modes: no lock (type=0), protected read (type=3), and exclusive (type=5). In the previous example, the lock is owned by node 1 (owner=1) and node 3 has been granted a protected-read lock on the file-system resource.

  9. Use the following command to search for processes that are in an uninterruptable sleep state, which are indicated by the D flag in the STAT column:

    ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

    Note that at least one of the processes that are in the uninterruptable sleep state is responsible for the hang on the other node.

If a process is waiting for I/O to complete, the problem could be anywhere in the I/O subsystem, from the block device layer through the drivers, to the disk array. If the hang concerns a user lock (flock()), the problem could lie with the application. If possible, kill the holder of the lock. If the hang is due to lack of memory or fragmented memory, you can free up memory by killing non-essential processes. The most immediate solution is to reset the node that is holding the lock. The DLM recovery process can then clear all of the locks owned by the dead node owned; thus, enabling the cluster to continue to operate.

Configuring the Behavior of Fenced Nodes With Kdump

If the hearbeat mechanism detects that a node with a mounted OCFS2 volume has lost contact with the other cluster nodes, that node is removed from the cluster in a process called fencing. Fencing prevents other nodes from hanging while attempting to access resources that are held by the fenced node. By default, a fenced node automatically restarts to be able to quickly rejoin the cluster. 

However, under some circumstances, this default behavior might not be desirable. For example, if a node frequently restarts for no apparent reason, then causing the node to panic instead of restarting is preferable to enable you to troubleshoot the issue. By enabling Kdump on the node, you can obtain a vmcore crash dump from the fenced node, which you can analyze to diagnose the cause of frequent node restarts.

To configure a node to panic at the next fencing, set fence_method to panic by runnng the following command on the node after the cluster starts:

echo "panic" | sudo tee /sys/kernel/config/cluster/cluster-name/fence_method

To set the value after each system reboot, add the same line to the /etc/rc.local file.

To restore the default behavior, change the value of fence_method back to reset.

echo "reset" | sudo tee /sys/kernel/config/cluster/cluster-name/fence_method

Likewise, remove the panic line from /etc/rc.local if the line exists in the file.