3 Configuring an Initial Cluster and Service

This chapter provides an example, along with step-by-step instructions on configuring an initial cluster across two nodes that are hosted on systems with the resolvable host names node1 and node2. Each system is installed and configured by using the instructions that are provided in Installing and Configuring Pacemaker and Corosync.

The cluster is configured to run a service, Dummy, that is included in the resource-agents package. You should have installed this package along with the pacemaker packages. This tool simply keeps track of whether the service is or is not running. Pacemaker is configured with an interval parameter that determines how long it should wait between checks to determine whether the Dummy process has failed.

The Dummy process is manually stopped outside of the Pacemaker tool to simulate a failure, which is used to demonstrate how the process is restarted automatically on an alternate node.

Creating the Cluster

To create the cluster:

  1. Authenticate the pcs cluster configuration tool for the hacluster user on each node in your configuration by running the following command on one of the nodes that will form part of the cluster:

    sudo pcs host auth node1 node2 -u hacluster

    Replace node1 and node2 with the resolvable hostnames of the nodes that will form part of the cluster.

    Alternately, if the node names are not resolvable, specify the IP addresses where the nodes can be accessed, as shown in the following example:

    sudo pcs host auth node1 addr=192.0.2.1 node2 addr=192.0.2.2 -u hacluster

    Replace 192.0.2.1 and 192.0.2.2 with the IP addresses of each of the respective hosts in the cluster.

    The tool prompts you to provide a password for the hacluster user. Provide the password that you set for this user when you installed and configured the Pacemaker software on each node.

  2. Create the cluster by using the pcs cluster setup command. You must specify a name for the cluster and the node names and IP addresses for each node in the cluster. For example, run the following command:

    sudo pcs cluster setup pacemaker1 node1 addr=192.0.2.1 node2 addr=192.0.2.2

    Replace pacemaker1 with an appropriate name for the cluster. Replace node1 and node2 with the resolvable hostnames of the nodes in the cluster. Replace 192.0.2.1 and 192.0.2.2 with the IP addresses of each of the respective hosts in the cluster.

    Note that if you used the addr option to specify the IP addresses when authenticated the nodes, you do not need to specify them again when running the pcs cluster setup command.

    The cluster setup process destroys any existing cluster configuration on the specified nodes and creates a configuration file for the Corosync service that is copied to each of the nodes within the cluster.

    You can, optionally, use the --start option when running the pcs cluster setup command to automatically start the cluster once it is created.

  3. If you have not already started the cluster as part of the cluster setup command, start the cluster on all of the nodes. To start the cluster manually, use the pcs command:

    sudo pcs cluster start --all

    Starting the pacemaker service from systemd is another way to start the cluster on all nodes, for example:

    sudo systemctl start pacemaker.service
  4. Optionally, you can enable these services to start at boot time so that if a node reboots, it automatically rejoins the cluster, for example:

    sudo pcs cluster enable --all

    Alternately you can enable the pacemaker service from systemd on all nodes, for example:

    sudo systemctl enable pacemaker.service

    Note:

    Some users prefer not to enable these services so that a node failure resulting in a full system reboot can be properly debugged before it rejoins the cluster.

Setting Cluster Parameters

Fencing is an important part of setting up a production-level HA cluster. For simplicity, it is disabled in this example. If you intend to take advantage of stonith, see About Fencing Configuration (stonith) for additional information.

To set cluster parameters:

  1. Disable the fencing feature by running the following command:

    sudo pcs property set stonith-enabled=false

    Fencing is an advanced feature that helps protect your data from being corrupted by nodes that might be failing or are unavailable. Pacemaker uses the term stonith (shoot the other node in the head) to describe fencing options. This configuration depends on particular hardware and a deeper understanding of the fencing process. For this reason, it is recommended that you disable the fencing feature.

  2. Optionally, configure the cluster to ignore the quorum state by running the following command:

    sudo pcs property set no-quorum-policy=ignore

    Because this example uses a two-node cluster, disabling the no-quorum policy makes the most sense, as quorum technically requires a minimum of three nodes to be a viable configuration. Quorum is only achieved when more than half of the nodes agree on the status of the cluster.

    In the current release of Corosync, this issue is treated specially for two-node clusters, where the quorum value is artificially set to 1 so that the primary node is always considered in quorum. In the case where a network outage results in both nodes going offline for a period, the nodes race to fence each other and the first to succeed wins quorum. The fencing agent can usually be configured to give one node priority so that it is more likely to win quorum if this is preferred.

  3. Configure a migration policy by running the following command:

    sudo pcs resource defaults update

    Running this command configures the cluster to move the service to a new node after a single failure.

Creating a Service and Testing Failover

To create a service and test failover:

Services are created and usually configured to run a resource agent that is responsible for starting and stopping processes. Most resource agents are created according to the OCF (Open Cluster Framework) specification, which is defined as an extension for the Linux Standard Base (LSB). Many handy resource agents for commonly used processes are included in the resource-agents packages, including various heartbeat agents that track whether commonly used daemons or services are still running.

In the following example, a service is set up that uses a Dummy resource agent created precisely to test Pacemaker. This agent is used because it requires a basic configuration and doesn't make any assumptions about the environment or the types of services that you intend to run with Pacemaker.

  1. Add the service as a resource by using the pcs resource create command:

    sudo pcs resource create dummy_service ocf:pacemaker:Dummy op monitor interval=120s

    In the previous example, dummy_service is the name that is provided for the service for this resource:

    To invoke the Dummy resource agent, a notation (ocf:pacemaker:Dummy) is used to specify that it conforms to the OCF standard, that it runs in the pacemaker namespace, and that the Dummy script is used. If you were configuring a heartbeat monitor service for an Oracle Database, you might use the ocf:heartbeat:oracle resource agent.

    The resource is configured to use the monitor operation in the agent and an interval is set to check the health of the service. In this example, the interval is set to 120s to give the service sufficient time to fail while you're demonstrating failover. By default, this interval is typically set to 20 seconds, but it can be modified depending on the type of service and the particular environment.

    When you create a service, the cluster starts the resource on a node by using the resource agent's start command.

  2. View the resource start and run status, for example:

    sudo pcs status

    The following output is displayed:

    Cluster name: pacemaker1
    Stack: corosync
    Current DC: node2 (version 2.1.2-4.0.1.el8_6.2-ada5c3b36e2) - partition with quorum
    Last updated: Wed Jul 13 04:56:27 2022
    Last change: Wed Jul 13 04:56:11 2022 by root via cibadmin on node1
    
    2 nodes configured
    1 resource configured
    
    Online: [ node1 node2 ]
    
    Full list of resources:
    
     dummy_service  (ocf::pacemaker:Dummy): Started node1
    
    Daemon Status:
      corosync: active/disabled
      pacemaker: active/disabled
      pcsd: active/enabled
  3. Run the crm_resource command to simulate service failure by force stopping the service directly:

    sudo crm_resource --resource dummy_service --force-stop

    Running the crm_resource command ensures that the cluster is unaware that the service has been manually stopped.

  4. Run the crm_mon command in interactive mode so that you can wait until a node fails, to view the Failed Actions message, for example:

    sudo crm_mon

    The following output is displayed:

    Stack: corosync
    Current DC: node1 (version 2.1.2-4.0.1.el8_6.2-ada5c3b36e2) - partition with quorum
    Last updated: Wed Jul 13 05:00:27 2022
    Last change: Wed Jul 13 04:56:11 2022 by root via cibadmin on node1
    
    3 nodes configured
    1 resource configured
    
    Online: [ node1 node2 ]
    
    Active resources:
    
    dummy_service   (ocf::pacemaker:Dummy): Started node2
    
    Failed Resource Actions:
    * dummy_service_monitor_120000 on node1 'not running' (7): call=7, status=complete, exitreason='',
        last-rc-change='Wed Jul 13 05:00:11 2022', queued=0ms, exec=0ms

    You can see the service restart on the alternate node. Note that the default monitor period is set to 120 seconds, so you might need to wait up to the full period before you see notification that a node has gone offline.

    Tip:

    You can use the Ctrl-C key combination to exit out of crm_mon at any point.

  5. Reboot the node where the service is running to determine whether failover also occurs in the case of node failure.

    Note that if you didn't enable the corosync and pacemaker services to start on boot, you might need to manually start the services on the node that you rebooted by running the following command:

    sudo pcs cluster start node1