Chapter 3 Managing the Oracle Cluster File System Version 2 in Oracle Linux

This chapter includes information about managing the Oracle Cluster File System Version 2 (OCFS2) in Oracle Linux 8. The chapter includes tasks for configuring, administering, and troubleshooting OCFS2.

Note

In Oracle Linux 8, the OCFS2 file system type is supported on Unbreakable Enterprise Kernel (UEK) releases only, starting with Unbreakable Enterprise Kernel Release 6 (UEK R6).

For information about local file system management in Oracle Linux, see Oracle® Linux 8: Managing Local File Systems.

3.1 About OCFS2

OCFS2 (Oracle Cluster File System Version 2) is a general-purpose shared-disk file system that is intended for use with clusters. OCFS2 offers high performance, as well as high availability. It is also possible to mount an OCFS2 volume on a standalone, non-clustered system.

Although it might appear that the ability to mount an ocfs2 file system locally has no benefits, when compared to alternative file systems such as Ext4 or Btrfs, you can use the reflink command with OCFS2 to create copy-on-write clones of individual files. You can also use the cp --reflink command in a similar way that you would on a Btrfs file system. Typically, such clones enable you to save disk space when storing multiple copies of very similar files, such as virtual machine (VM) images or Linux Containers. In addition, mounting a local OCFS2 file system enables you to subsequently migrate it to a cluster file system without requiring any conversion. Note that when using the reflink command, the resulting file system behaves like a clone of the original files ystem, which means that their UUIDs are identical. When using the reflink command to create a clone, you must change the UUID by using the tunefs.ocfs2 command. See Section 3.4.5, “Querying and Changing Volume Parameters”.

Almost all applications can use OCFS2 as it provides local file-system semantics. Applications that are cluster-aware can use cache-coherent parallel I/O from multiple cluster nodes to balance activity across the cluster, or they can use of the available file-system functionality to fail over and run on another node in the event that a node fails.

The following are examples of some typical use cases for OCFS2:

  • Oracle VM to host shared access to virtual machine images.

  • Oracle VM and VirtualBox to enable Linux guest machines to share a file system.

  • Oracle Real Application Cluster (RAC) in database clusters.

  • Oracle E-Business Suite in middleware clusters.

The following OCFS2 features that make it a suitable choice for deployment in an enterprise-level computing environment:

  • Support for ordered and write-back data journaling that provides file system consistency in the event of power failure or system crash.

  • Block sizes ranging from 512 bytes to 4 KB, and file-system cluster sizes ranging from 4 KB to 1 MB (both in increments of powers of 2). The maximum supported volume size is 16 TB, which corresponds to a cluster size of 4 KB. A volume size as large as 4 PB is theoretically possible for a cluster size of 1 MB, although this limit has not been tested.

  • Inclusion of nowait support for OCFS2 in Unbreakable Enterprise Kernel Release 6 (UEK R6).

    When the nowait flag is specified, -EAGAIN is returned if the following checks fail for direct I/O: cannot get related locks immediately or blocks are not allocated at the write location, which triggers block allocation and subsequently blocks I/O operations.

  • Extent-based allocations for efficient storage of very large files.

  • Optimized allocation support for sparse files, inline-data, unwritten extents, hole punching, reflinks, and allocation reservation for high performance and efficient storage.

  • Indexing of directories to allow efficient access to a directory even if it contains millions of objects.

  • Metadata checksums for the detection of corrupted inodes and directories.

  • Extended attributes to allow an unlimited number of name:value pairs to be attached to file system objects such as regular files, directories, and symbolic links.

  • Advanced security support for POSIX ACLs and SELinux in addition to the traditional file-access permission model.

  • Support for user and group quotas.

  • Support for heterogeneous clusters of nodes with a mixture of 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64), architectures.

  • An easy-to-configure, in-kernel cluster-stack (O2CB) with the Linux Distributed Lock Manager (DLM) for managing concurrent access from the cluster nodes.

  • Support for buffered, direct, asynchronous, splice, and memory-mapped I/O.

  • A toolset that uses similar parameters as the ext3 file system.

For more information about OCFS2, visit https://oss.oracle.com/projects/ocfs2/documentation/.

3.2 Maximum File System Size Requirements for OCFS2

Starting with the Oracle Linux 8.2 release, the OCFS2 file system is supported on systems that are running the Unbreakable Enterprise Kernel Release 6 (UEK R6) kernel.

The maximum file size and maximum file system size requirements for OCFS2 are as follows:

  • Maximum file size:

    4 PiB

  • Maximum file system size:

    4 PiB

3.3 Installing and Configuring an OCFS2 Cluster

The following procedures describe how to set up a cluster to use OCFS2.

3.3.1 Preparing a Cluster for OCFS2

For the best performance, each node in the cluster should have at least two network interfaces. The first interface is connected to a public network to allow general access to the systems, while the second interface is used for private communications between the nodes and the cluster heartbeat. This second interface determines how the cluster nodes coordinate their access to shared resources and how they monitor each other's state.

Note

Both network interfaces must be connected through a network switch. Additionally, you must ensure that all of the network interfaces are configured and working before configuring the cluster.

You can choose from the following two cluster heartbeat configurations:

  • Local heartbeat thread for each shared device (default heartbeat mode).

    In this mode, a node starts a heartbeat thread when it mounts an OCFS2 volume and stops the thread when it unmounts the volume. There is a large CPU overhead on nodes that mount a large number of OCFS2 volumes as each mount requires a separate heartbeat thread. Note that a large number of mounts also increases the risk of a node fencing itself out of the cluster due to a heartbeat I/O timeout on a single mount.

  • Global heartbeat on specific shared devices.

    This mode enables you to configure any OCFS2 volume as a global heartbeat device, provided that it occupies a whole disk device and not a partition. In this mode, the heartbeat to the device starts when the cluster comes online and stops when the cluster goes offline. This mode is recommended for clusters that mount a large number of OCFS2 volumes. A node fences itself out of the cluster if a heartbeat I/O timeout occurs on more than half of the global heartbeat devices. To provide redundancy against failure of one of the devices, you should configure at least three global heartbeat devices.

The following figure shows a cluster of four nodes that are connected by using a network switch to a LAN and a network storage server. The nodes and storage server are also connected by using a switch to a private network that is used for the local cluster heartbeat.

Figure 3.1 Cluster Configuration by Using a Private Network
The figure shows a cluster of four nodes that are connected by using a network switch to a LAN and a network storage server. The nodes and the storage server are also connected by using a switch to a private network, which is used for the cluster heartbeat.


Although it is possible to configure and use OCFS2 without using a private network, note that such a configuration increases the probability of a node fencing itself out of the cluster due to an I/O heartbeat timeout.

3.3.2 Configuring the Firewall for the Cluster

Configure or disable the firewall on each node to allow access on the interface that the cluster will use for private cluster communication. By default, the cluster uses both TCP and UDP over port 7777.

For example, to allow incoming TCP connections and UDP datagrams on port 7777, you would use the following command:

# firewall-cmd --zone=zone --add-port=7777/tcp --add-port=7777/udp
# firewall-cmd --permanent --zone=zone --add-port=7777/tcp --add-port=7777/udp

3.3.3 Configuring the Cluster Software

Ideally, each node should be running the same version of OCFS2 software and a compatible UEK release. It is possible for a cluster to run with mixed versions of the OCFS2 and UEK software; for example, while you are performing a rolling update of a cluster. The cluster node that is running the lowest version of the software determines the set of usable features.

Use the dnf command to install or upgrade the following packages to the same version on each node:

  • kernel-uek

  • ocfs2-tools

Note

If you want to use the global heartbeat feature, you need to install the ocfs2-tools-1.8.0-11 or later package.

3.3.4 Creating the Configuration File for the Cluster Stack

You create the configuration file by using the o2cb command or by using a text editor.

To configure the cluster stack by using the o2cb command:

  1. Create a cluster definition by using the following command:

    # o2cb add-cluster cluster_name 

    For example, you would define a cluster named mycluster with four nodes as follows:

    # o2cb add-cluster mycluster

    The previous command creates the /etc/ocfs2/cluster.conf configuration file, if it does not already exist.

  2. For each node, define the node as follows:

    # o2cb add-node cluster_name node_name --ip ip_address

    The name of the node must be same as the value of system's HOSTNAME that is configured in the /etc/sysconfig/network file. The IP address will be used by the node for private communication in the cluster.

    For example, you would use the following command to define a node named node0, with the IP address 10.1.0.100, in the cluster mycluster:

    # o2cb add-node mycluster node0 --ip 10.1.0.100
    Note

    OCFS2 only supports IPv4 addresses.

  3. If you want the cluster to use global heartbeat devices, run the following commands:

    # o2cb add-heartbeat cluster_name device1
    .
    .
    .
    # o2cb heartbeat-mode cluster_name global
    Important

    You must configure the global heartbeat feature to use whole disk devices. You cannot configure a global heartbeat device on a disk partition.

    For example, you would use /dev/sdd, /dev/sdg, and /dev/sdj as global heartbeat devices by typing the following commands:

    # o2cb add-heartbeat mycluster /dev/sdd
    # o2cb add-heartbeat mycluster /dev/sdg
    # o2cb add-heartbeat mycluster /dev/sdj
    # o2cb heartbeat-mode mycluster global
  4. Copy the cluster /etc/ocfs2/cluster.conf file to each node in the cluster.

  5. Restart the cluster stack for the changes you made to the cluster configuration file to take effect.

The following is a typical example of the /etc/ocfs2/cluster.conf file. This particular configuration defines a 4-node cluster named mycluster, with a local heartbeat:

node:
	name = node0
	cluster = mycluster
	number = 0
	ip_address = 10.1.0.100
	ip_port = 7777

node:
        name = node1
        cluster = mycluster
        number = 1
        ip_address = 10.1.0.101
        ip_port = 7777

node:
        name = node2
        cluster = mycluster
        number = 2
        ip_address = 10.1.0.102
        ip_port = 7777

node:
        name = node3
        cluster = mycluster
        number = 3
        ip_address = 10.1.0.103
        ip_port = 7777

cluster:
        name = mycluster
        heartbeat_mode = local
        node_count = 4

If you configure your cluster to use a global heartbeat, the file also include entries for the global heartbeat devices, as shown in the following example:

node:
        name = node0
        cluster = mycluster
        number = 0
        ip_address = 10.1.0.100
        ip_port = 7777

node:
        name = node1
        cluster = mycluster
        number = 1
        ip_address = 10.1.0.101
        ip_port = 7777

node:
        name = node2
        cluster = mycluster
        number = 2
        ip_address = 10.1.0.102
        ip_port = 7777

node:
        name = node3
        cluster = mycluster
        number = 3
        ip_address = 10.1.0.103
        ip_port = 7777

cluster:
        name = mycluster
        heartbeat_mode = global
        node_count = 4

heartbeat:
        cluster = mycluster
        region = 7DA5015346C245E6A41AA85E2E7EA3CF

heartbeat:
        cluster = mycluster
        region = 4F9FBB0D9B6341729F21A8891B9A05BD

heartbeat:
        cluster = mycluster
        region = B423C7EEE9FC426790FC411972C91CC3

The cluster heartbeat mode is now shown as global and the heartbeat regions are represented by the UUIDs of their block devices.

If you edit the configuration file manually, ensure that you use the following layout:

  • The cluster:, heartbeat:, and node: headings must start in the first column.

  • Each parameter entry must be indented by one tab space.

  • A blank line must separate each section that defines the cluster, a heartbeat device, or a node.

3.3.5 Configuring the Cluster Stack

When configuring a cluster stack, there are several values for which you are prompted. Refer to the following table for the values that need to provide.

Prompt

Description

Load O2CB driver on boot (y/n)

Specify whether the cluster stack driver should be loaded at boot time. The default response is n.

Cluster stack backing O2CB

Name of the cluster stack service. The default and usual response is o2cb.

Cluster to start at boot (Enter "none" to clear)

Enter the name of your cluster that you defined in the cluster configuration file, /etc/ocfs2/cluster.conf.

Specify heartbeat dead threshold (>=7)

Number of 2-second heartbeats that must elapse without response before a node is considered dead. To calculate the value to enter, divide the required threshold time period by 2 and then add 1. For example, to set the threshold time period to 120 seconds, enter a value of 61. The default value is 31, which corresponds to a threshold time period of 60 seconds.

Note

If your system uses multipathed storage, the recommended value is 61 or greater.

Specify network idle timeout in ms (>=5000)

Time in milliseconds that must elapse before a network connection is considered dead. The default value is 30,000 milliseconds.

Note

For bonded network interfaces, the recommended value is 30,000 milliseconds or greater.

Specify network keepalive delay in ms (>=1000)

Maximum delay in milliseconds between sending keepalive packets to another node. The default and recommended value is 2,000 milliseconds.

Specify network reconnect delay in ms (>=2000)

Minimum delay in milliseconds between reconnection attempts if a network connection goes down. The default and recommended value is 2,000 milliseconds.

Follow these steps to configure the cluster stack:

  1. On each node of the cluster, run the following command:

    # /sbin/o2cb.init configure

    Verify the settings for the cluster stack.

    # /sbin/o2cb.init status
    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "mycluster": Online
      Heartbeat dead threshold: 61
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Local
    Checking O2CB heartbeat: Active

    In the previous example, the cluster is online and is using the local heartbeat mode. If no volumes have been configured, the O2CB heartbeat is shown as Not active, rather than Active.

    The following example shows the command output for an online cluster that is using three global heartbeat devices:

    # /sbin/o2cb.init status
    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "mycluster": Online
      Heartbeat dead threshold: 61
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Global
    Checking O2CB heartbeat: Active
      7DA5015346C245E6A41AA85E2E7EA3CF /dev/sdd
      4F9FBB0D9B6341729F21A8891B9A05BD /dev/sdg
      B423C7EEE9FC426790FC411972C91CC3 /dev/sdj
  2. Configure the o2cb and ocfs2 services so that they start at boot time after networking is enabled.

    # systemctl enable o2cb
    # systemctl enable ocfs2

    These settings enable the node to mount OCFS2 volumes automatically when the system starts.

3.3.6 Configuring the Kernel for Cluster Operation

To ensure the correct operation of a cluster, you must configure the required kernel settings, as described in the following table.

Kernel Setting

Description

panic

STOPPED HERE. Specifies the number of seconds after a panic before a system automatically resets itself.

If the value is 0, the system hangs, which allows you to collect detailed information about the panic for troubleshooting. This is the default value.

To enable automatic reset, set a non-zero value. If you require a memory image (vmcore), allow enough time for Kdump to create this image. The suggested value is 30 seconds, although large systems will require a longer time.

panic_on_oops

Specifies that a system must panic if a kernel oops occurs. If a kernel thread required for cluster operation crashes, the system must reset itself. Otherwise, another node might not be able to tell whether a node is slow to respond or unable to respond, causing cluster operations to hang.

  1. On each node, set the recommended values for panic and panic_on_oops, for example:

    # sysctl kernel.panic=30
    # sysctl kernel.panic_on_oops=1

  2. To make the change persist across reboots, add the following entries to the /etc/sysctl.conf file:

    # Define panic and panic_on_oops for cluster operation
    kernel.panic=30
    kernel.panic_on_oops=1

3.3.7 Commands for Administering the Cluster Stack

There are several commands that you can use to administer the cluster stack. The following table describes the commands for performing various operations on the cluster stack.

Command

Description

/sbin/o2cb.init status

Check the status of the cluster stack.

/sbin/o2cb.init online

Start the cluster stack.

/sbin/o2cb.init offline

Stop the cluster stack.

/sbin/o2cb.init unload

Unload the cluster stack.

3.4 Administering OCFS2 Volumes

The following tasks describe how to administer OCFS2 volumes.

3.4.1 Commands for Creating OCFS2 Volumes

You use the mkfs.ocfs2 command to create an OCFS2 volume on a device. If you want to label the volume and mount it by specifying the label, the device must correspond to a partition. You cannot mount an unpartitioned disk device by specifying a label.

The following table describes some useful options that you can use when creating an OCFS2 volume.

Command Option

Description

-b block-size

--block-size block-size

Specifies the unit size for I/O transactions to and from the file system, and the size of inode and extent blocks. The supported block sizes are 512 (512 bytes), 1K, 2K, and 4K. The default and recommended block size is 4K (4 kilobytes).

-C cluster-size

--cluster-size cluster-size

Specifies the unit size for space used to allocate file data. The supported cluster sizes are 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, and 1M (1 megabyte). The default cluster size is 4K (4 kilobytes).

--fs-feature-level=feature-level

Enables you select a set of file-system features:

default

Enables support for the sparse files, unwritten extents, and inline data features.

max-compat

Enables only those features that are understood by older versions of OCFS2.

max-features

Enables all features that OCFS2 currently supports.

--fs_features=feature

Enables you to enable or disable individual features such as support for sparse files, unwritten extents, and backup superblocks. For more information, see the mkfs.ocfs2(8) manual page.

-J size=journal-size

--journal-options size=journal-size

Specifies the size of the write-ahead journal. If not specified, the size is determined from the file system usage type that you specify to the -T option, and, otherwise, from the volume size. The default size of the journal is 64M (64 MB) for datafiles, 256M (256 MB) for mail, and 128M (128 MB) for vmstore.

-L volume-label

--label volume-label

Specifies a descriptive name for the volume that allows you to identify it easily on different cluster nodes.

-N number

--node-slots number

Determines the maximum number of nodes that can concurrently access a volume, which is limited by the number of node slots for system files such as the file-system journal. For best performance, set the number of node slots to at least twice the number of nodes. If you subsequently increase the number of node slots, performance can suffer because the journal will no longer be contiguously laid out on the outer edge of the disk platter.

-T file-system-usage-type

Specifies the type of usage for the file system:

datafiles

Database files are typically few in number, fully allocated, and relatively large. Such files require few metadata changes, and do not benefit from having a large journal.

mail

Mail server files are typically many in number, and relatively small. Such files require many metadata changes, and benefit from having a large journal.

vmstore

Virtual machine image files are typically few in number, sparsely allocated, and relatively large. Such files require a moderate number of metadata changes and a medium sized journal.

3.4.2 Suggested Cluster Size Settings

The following table provides suggested recommendations for minimum cluster size settings for different file system size ranges.

File System Size

Suggested Minimum Cluster Size

1 GB - 10 GB

8K

10GB - 100 GB

16K

100 GB - 1 TB

32K

1 TB - 10 TB

64K

10 TB - 16 TB

128K

3.4.3 Creating OCFS2 Volumes

When creating OCFS2 volumes, keep the following additional points in mind:

  • Do not create an OCFS2 volume on an LVM logical volume, as LVM is not cluster-aware.

  • You cannot change the block and cluster size of an OCFS2 volume after you have created it. You can use the tunefs.ocfs2 command to modify other settings for the file system, with certain restrictions. For more information, see the tunefs.ocfs2(8) manual page.

  • If you intend that the volume store database files, do not specify a cluster size that is smaller than the block size of the database.

  • The default cluster size of 4 KB is not suitable if the file system is larger than a few gigabytes.

The following examples show some of the ways in which you can create an OCFS2 volume.

Create an OCFS2 volume on /dev/sdc1 labeled myvol using all of the default settings for generic usage on file systems that are no larger than a few gigabytes. The default values are a 4 KB block and cluster size, eight node slots, a 256 MB journal, and support for default file-system features:

# mkfs.ocfs2 -L "myvol" /dev/sdc1

Create an OCFS2 volume on /dev/sdd2 labeled as dbvol for use with database files. In this case, the cluster size is set to 128 KB and the journal size to 32 MB.

# mkfs.ocfs2 -L "dbvol" -T datafiles /dev/sdd2

Create an OCFS2 volume on /dev/sde1, with a 16 KB cluster size, a 128 MB journal, 16 node slots, and support enabled for all features except refcount trees.

# mkfs.ocfs2 -C 16K -J size=128M -N 16 --fs-feature-level=max-features \
  --fs-features=norefcount /dev/sde1

3.4.4 Mounting OCFS2 Volumes

Specify the _netdev option in the /etc/fstab file if you want the system to mount an OCFS2 volume at boot time after networking is started and unmount the file system before networking is stopped, as shown in the following example:

myocfs2vol  /dbvol1  ocfs2     _netdev,defaults  0 0
Note

For the file system to mount, you must enable the o2cb and ocfs2 services to start after networking is started. See Section 3.3.5, “Configuring the Cluster Stack”.

3.4.5 Querying and Changing Volume Parameters

Use the tunefs.ocfs2 command to query or change volume parameters.

For example, to find out the label, UUID, and number of node slots for a volume, you would use the following command:

# tunefs.ocfs2 -Q "Label = %V\nUUID = %U\nNumSlots =%N\n" /dev/sdb
Label = myvol
UUID = CBB8D5E0C169497C8B52A0FD555C7A3E
NumSlots = 4

You would generate a new UUID for a volume by using the following command:

# tunefs.ocfs2 -U /dev/sda
# tunefs.ocfs2 -Q "Label = %V\nUUID = %U\nNumSlots =%N\n" /dev/sdb
Label = myvol
UUID = 48E56A2BBAB34A9EB1BE832B3C36AB5C
NumSlots = 4

3.5 Creating a Local OCFS2 File System

Note

The OCFS2 file system type is supported on the Unbreakable Enterprise Kernel (UEK) release only.

The following procedure describes how to create an OCFS2 file system to be mounted locally, which is not associated with a cluster.

To create an OCFS2 file system that is to be mounted locally, use the following command syntax:

# mkfs.ocfs2 -M local --fs-features=local -N 1 [options] device

For example, you would create a locally mountable OCFS2 volume on /dev/sdc1, with one node slot and the label localvol, as follows:

# mkfs.ocfs2 -M local --fs-features=local -N 1 -L "localvol" /dev/sdc1

You can use the tunefs.ocfs2 utility to convert a local OCTFS2 file system to cluster use, as follows:

# umount /dev/sdc1
# tunefs.ocfs2 -M cluster --fs-features=cluster -N 8 /dev/sdc1

The previous example also increases the number of node slots from 1 to 8, to allow up to eight nodes to mount the file system.

3.6 Troubleshooting OCFS2 Issues

Refer to the following information when investigating how to resolve issues that you might encounter when administering OCFS2.

3.6.1 Recommended Debugging Tools and Practices

You can use the following tools to troubleshoot OCFS2 issues:

  • It is recommended that you set up netconsole on the nodes to capture an oops trace.

  • You can use the tcpdump command to capture the DLM's network traffic between nodes. For example, to capture TCP traffic on port 7777 for the private network interface em2, you could use the following command:

    # tcpdump -i em2 -C 10 -W 15 -s 10000 -Sw /tmp/`hostname -s`_tcpdump.log \
      -ttt 'port 7777' &
  • The debugfs.ocfs2 command tarcesevents in the OCFS2 driver, determine lock statuses, walk directory structures, examine inodes, and so on. This command is similar in behavior to the debugfs command that is used for the ext3 file system.

    For more information, see the debugfs.ocfs2(8) manual page.

  • Use the o2image command to save an OCFS2 file system's metadata, including information about inodes, file names, and directory names, to an image file on another file system. Because the image file contains only metadata, it is much smaller than the original file system. You can use the debugfs.ocfs2 command to open the image file and analyze the file system layout to determine the cause of a file system corruption or performance problem.

    For example, to create the image /tmp/sda2.img from the OCFS2 file system on the device /dev/sda2, you would use the following command:

    # o2image /dev/sda2 /tmp/sda2.img

    For more information, see the o2image(8) manual page.

3.6.2 Mounting the debugfs File System

OCFS2 uses the debugfs file system to enable userspace access to information about its in-kernel state. Note that you must mount the debugfs file system to use the debugfs.ocfs2 command.

For example, to mount the debugfs file system, add the following line to the /etc/fstab file:

debugfs    /sys/kernel/debug      debugfs  defaults  0 0

Then, run the mount -a command.

3.6.3 Configuring OCFS2 Tracing

You can use the following commands and methods to trace issues in OCFS2.

3.6.3.1 Commands for Tracing OCFS2 Issues

The following table describes several commands that are useful for tracing issues.

Command

Description

debugfs.ocfs2 -l

List all of the trace bits and their statuses.

debugfs.ocfs2 -l SUPER allow

Enable tracing for the superblock.

debugfs.ocfs2 -l SUPER off

Disable tracing for the superblock.

debugfs.ocfs2 -l SUPER deny

Disallow tracing for the superblock, even if it is implicitly enabled by another tracing mode setting.

debugfs.ocfs2 -l HEARTBEAT \

ENTRY EXIT allow

Enable heartbeat tracing.

debugfs.ocfs2 -l HEARTBEAT off \

ENTRY EXIT deny

Disable heartbeat tracing. Note that the ENTRY and EXIT parameters are set to deny, as they exist in all trace paths.

debugfs.ocfs2 -l ENTRY EXIT \

NAMEI INODE allow

Enable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT \

deny NAMEI INODE allow

Disable tracing for the file system.

debugfs.ocfs2 -l ENTRY EXIT \

DLM DLM_THREAD allow

Enable tracing for the DLM.

debugfs.ocfs2 -l ENTRY EXIT \

deny DLM DLM_THREAD allow

Disable tracing for the DLM.

3.6.3.2 OCFS2 Tracing Methods and Examples

One method that you can use to obtain a trace is to first enable the trace, sleep for a short while, and then disable the trace. As shown in the following example, to avoid unnecessary output, you should reset the trace bits to their default settings after you have finished tracing:

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow && sleep 10 && \
  debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

To limit the amount of information that is displayed, enable only the trace bits that are relevant to diagnosing the problem.

If a specific file system command, such as mv, is causing an error, you might use commands such as those used in the following example to trace the error:

# debugfs.ocfs2 -l ENTRY EXIT NAMEI INODE allow
# mv source destination & CMD_PID=$(jobs -p %-)
# echo $CMD_PID
# debugfs.ocfs2 -l ENTRY EXIT deny NAMEI INODE off 

Because the trace is enabled for all mounted OCFS2 volumes, knowing the correct process ID can help you to interpret the trace.

For more information, see the debugfs.ocfs2(8) manual page.

3.6.4 Debugging File System Locks

If an OCFS2 volume hangs, you can use the following procedure to determine which locks are busy and which processes are likely to be holding the locks.

In the following procedure, the Lockres value refers to the lock name that is used by DLM, which is a combination of a lock-type identifier, inode number, and a generation number. The following table lists the various lock types and their associated identifier.

Table 3.1 DLM Lock Types

Identifier

Lock Type

D

File data

M

Metadata

R

Rename

S

Superblock

W

Read-write


  1. Mount the debug file system.

    # mount -t debugfs debugfs /sys/kernel/debug
  2. Dump the lock statuses for the file system device, which is /dev/sdx1 in the following example:

    # echo "fs_locks" | debugfs.ocfs2 /dev/sdx1 >/tmp/fslocks 62
    Lockres: M00000000000006672078b84822 Mode: Protected Read
    Flags: Initialized Attached
    RO Holders: 0 EX Holders: 0
    Pending Action: None Pending Unlock Action: None
    Requested Mode: Protected Read Blocking Mode: Invalid
  3. Use the Lockres value from the output in the previous step to obtain the inode number and generation number for the lock.

    # echo "stat <M00000000000006672078b84822>" | debugfs.ocfs2 -n /dev/sdx1
    Inode: 419616   Mode: 0666   Generation: 2025343010 (0x78b84822)
    ... 
  4. Determine the file system object to which the inode number relates, for example:

    # echo "locate <419616>" | debugfs.ocfs2 -n /dev/sdx1
    419616 /linux-2.6.15/arch/i386/kernel/semaphore.c
  5. Obtain the lock names that are associated with the file system object.

    # echo "encode /linux-2.6.15/arch/i386/kernel/semaphore.c" | \
      debugfs.ocfs2 -n /dev/sdx1
    M00000000000006672078b84822 D00000000000006672078b84822 W00000000000006672078b84822  

    In the previous example, a metadata lock, a file data lock, and a read-write lock are associated with the file system object.

  6. Determine the DLM domain of the file system.

    # echo "stats" | debugfs.ocfs2 -n /dev/sdX1 | grep UUID: | while read a b ; do echo $b ; done
    82DA8137A49A47E4B187F74E09FBBB4B  
  7. Using the values of the DLM domain and the lock name that enables debugging for the DLM, run the following command:

    # echo R 82DA8137A49A47E4B187F74E09FBBB4B \
      M00000000000006672078b84822 > /proc/fs/ocfs2_dlm/debug  
  8. Examine the debug messages.

    # dmesg | tail
    struct dlm_ctxt: 82DA8137A49A47E4B187F74E09FBBB4B, node=3, key=965960985
      lockres: M00000000000006672078b84822, owner=1, state=0 last used: 0, 
      on purge list: no granted queue:
          type=3, conv=-1, node=3, cookie=11673330234144325711, ast=(empty=y,pend=n), 
          bast=(empty=y,pend=n) 
        converting queue:
        blocked queue:  

    The DLM supports three lock modes: no lock (type=0), protected read (type=3), and exclusive (type=5). In the previous example, the lock is mastered by node 1 (owner=1) and node 3 has been granted a protected-read lock on the file-system resource.

  9. Use the following command to search for processes that are in an uninterruptable sleep state, which are indicated by the D flag in the STAT column:

    # ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

    Note that at least one of the processes that are in the uninterruptable sleep state is responsible for the hang on the other node.

If a process is waiting for I/O to complete, the problem could be anywhere in the I/O subsystem, from the block device layer through the drivers, to the disk array. If the hang concerns a user lock (flock()), the problem could lie with the application. If possible, kill the holder of the lock. If the hang is due to lack of memory or fragmented memory, you can free up memory by killing non-essential processes. The most immediate solution is to reset the node that is holding the lock. The DLM recovery process can then clear all of the locks owned by the dead node owned; thus, enabling the cluster to continue to operate.

3.6.5 Configuring the Behavior of Fenced Nodes

If a node with a mounted OCFS2 volume assumes that it is no longer in contact with the other cluster nodes, it removes itself from the cluster. This process is called fencing. Fencing prevents other nodes from hanging when attempting to access resources that are held by the fenced node. By default, a fenced node restarts instead of panicking so that it can quickly rejoin the cluster. Under some circumstances, you might want a fenced node to panic instead of restarting. For example, you might want to use the netconsole command to view the oops stack trace or diagnose the cause of frequent reboots.

To configure a node to panic when it next fences, run the following command on the node after the cluster starts:

# echo panic > /sys/kernel/config/cluster/cluster_name/fence_method

In the previous command, cluster_name is the name of the cluster.

To set the value after each system reboot, add this line to the /etc/rc.local file. To restore the default behavior, use the reset value instead of the panic value.

3.7 OCFS2 Use Cases

The following are some typical use cases for OCFS2.

3.7.1 Load Balancing Use Case

You can use OCFS2 nodes to share resources between client systems. For example, the nodes could export a shared file system by using Samba or NFS. To distribute service requests between the nodes, you can use round-robin DNS, a network load balancer; or, you can specify which node should be used on each client.

3.7.2 Oracle Real Application Cluster Use Case

Oracle Real Application Cluster (RAC) uses its own cluster stack, Cluster Synchronization Services (CSS). You can use O2CB in conjunction with CSS, but note that each stack is configured independently for timeouts, nodes, and other cluster settings. You can use OCFS2 to host the voting disk files and the Oracle cluster registry (OCR), but not the grid infrastructure user's home, which must exist on a local file system on each node.

Because both CSS and O2CB use the lowest node number as a tie breaker in quorum calculations, ensure that the node numbers are the same in both clusters. If necessary, edit the O2CB configuration file, /etc/ocfs2/cluster.conf, to make the node numbering consistent. Then, update this file on all of the nodes. The change takes effect when the cluster is restarted.

3.7.3 Oracle Database Use Case

Specify the noatime option when mounting volumes that host Oracle datafiles, control files, redo logs, voting disk, and OCR. The noatime option disables unnecessary updates to the access time on the inodes.

Specify the nointr mount option to prevent signals interrupting I/O transactions that are in progress.

By default, the init.ora parameter filesystemio_options directs the database to perform direct I/O to the Oracle datafiles, control files, and redo logs. You should also specify the datavolume mount option for volumes that contain the voting disk and OCR. Do not specify this option for volumes that host the Oracle user's home directory or Oracle E-Business Suite.

To prevent database blocks from becoming fragmented across a disk, ensure that the file system cluster size is at minimum as large as the database block size, which is typically 8KB. If you specify the file system usage type as datafiles when using the mkfs.ocfs2 command, the file system cluster size is set to 128KB.

To enable multiple nodes to maximize throughput by concurrently streaming data to an Oracle datafile, OCFS2 deviates from the POSIX standard by not updating the modification time (mtime) on the disk when performing non-extending direct I/O writes. The value of mtime is updated in memory. However, OCFS2 does not write the value to disk unless an application extends or truncates the file or performs a operation to change the file metadata, such as using the touch command. This behavior leads to results with different nodes reporting different time stamps for the same file. Use the following command to view the on-disk timestamp of a file:

# debugfs.ocfs2 -R "stat /file_path" device | grep "mtime:"