16.3 Configuring Your First Cluster and Service

In this example, a cluster is configured across two nodes hosted on systems with the resolvable hostnames of node1 and node2. Each system is installed and configured using the instructions provided in Section 16.2, “Installing Pacemaker and Corosync”.

The cluster is configured to run a service, Dummy, that is included in the resource-agents package that you should have installed along with the pacemaker packages. This tool simply keeps track of whether it is running or not. We configure Pacemaker with an interval parameter that determines how long it should wait between checks to determine whether the Dummy process has failed.

We manually stop the Dummy process outside of the Pacemaker tool to simulate a failure and use this to demonstrate how the process is restarted automatically on an alternate node.

Creating the cluster

  1. Authenticate the pcs cluster configuration tool for the hacluster user on each node in your configuration. To do this, run the following command on one of the nodes that will form part of the cluster:

    # pcs cluster auth node1 node2 -u hacluster

    Replace node1 and node2 with the resolvable hostnames of the nodes that will form part of the cluster. The tool will prompt you to provide a password for the hacluster user. You should provide the password that you set for this user when you installed and configured the Pacemaker software on each node.

  2. To create the cluster, use the pcs cluster setup command. You must specify a name for the cluster and the resolvable hostnames for each node in the cluster:

    # pcs cluster setup --name pacemaker1 node1 node2

    Replace pacemaker1 with an appropriate name for the cluster. Replace node1 and node2 with the resolvable hostnames of the nodes in the cluster.

  3. Start the cluster on all nodes. You can do this manually using the pcs command:

    # pcs cluster start --all

    You can also do this by starting the pacemaker and corosync services from systemd:

    # systemctl start pacemaker.service
    # systemctl start corosync.service

    Optionally, you can enable these services to start at boot time, so that if a node reboots it automatically rejoins the cluster:

    # systemctl enable pacemaker.service
    # systemctl enable corosync.service

    Some users prefer not to enable these services, so that a node failure resulting in a full system reboot can be properly debugged before it rejoins the cluster.

Setting Cluster Parameters

  1. Fencing is an advanced feature that helps protect your data from being corrupted by nodes that may be failing or unavailable. Pacemaker uses the term stonith (shoot the other node in the head) to describe fencing options. Since this configuration depends on particular hardware and a deeper understanding of the fencing process, we recommend disabling the fencing feature for this example.

    # pcs property set stonith-enabled=false

    Fencing is an important part of setting up a production level HA cluster and is disabled in this example to keep things simple. If you intend to take advantage of stonith, see Section 16.4, “Fencing Configuration” for more information.

  2. Since this example is a two-node cluster, you can disable the no-quorum policy, as quorum requires a minimum of three nodes to make any sense. Quorum is only achieved when more than half of the nodes agree on the status of the cluster. In this example, quorum can never be reached, so configure the cluster to ignore the quorum state:

    # pcs property set no-quorum-policy=ignore
  3. Configure a migration policy. In this example we configure the cluster to move the service to a new node after a single failure:

    # pcs resource defaults migration-threshold=1

Creating a service and testing failover

Services are created and are usually configured to run a resource agent that is responsible for starting and stopping processes. Most resource agents are created according to the OCF (Open Cluster Framework) specification defined as an extension for the Linux Standard Base (LSB). There are many handy resource agents for commonly used processes included in the resource-agents packages, including a variety of heartbeat agents that track whether commonly used daemons or services are still running.

In this example we set up a service that uses a Dummy resource agent created precisely for the purpose of testing Pacemaker. We use this agent because it requires the least possible configuration and does not make any assumptions about your environment or the types of services that you intend to run with Pacemaker.

  1. To add the service as a resource, use the pcs resource create command. Provide a name for the service. In the example below, we use the name dummy_service for this resource:

    # pcs resource create dummy_service ocf:pacemaker:Dummy op monitor interval=120s

    To invoke the Dummy resource agent, a notation (ocf:pacemaker:Dummy) is used to specify that it conforms to the OCF standard, that it runs in the pacemaker namespace and that the Dummy script should be used. If you were configuring a heartbeat monitor service for an Oracle Database, you might use the ocf:heartbeat:oracle resource agent.

    The resource is configured to use the monitor operation in the agent and an interval is set to check the health of the service. In this example we set the interval to 120s to give the service some time to fail while you are demonstrating failover. By default, this is usually set to 20 seconds, but may be modified depending on the type of service and your own environment.

  2. As soon as you create a service, the cluster attempts to start the resource on a node using the resource agent's start command. You can see the resource start and run status by running the pcs status command:

    # pcs status
    Cluster name: pacemaker1
    Stack: corosync
    Current DC: node1 (version 1.1.16-12.el7-94ff4df) - partition with quorum
    Last updated: Wed Jan 17 06:35:18 2018
    Last change: Wed Jan 17 03:08:00 2018 by root via cibadmin on node1
    
    2 nodes configured
    1 resource configured
    
    Online: [ node2 node1 ]
    
    Full list of resources:
    
     dummy_service   (ocf::pacemaker:Dummy): Started node2
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled
  3. Simulate service failure by force stopping the service directly, using crm_resource, so that the cluster is unaware that the service has been manually stopped.

    # crm_resource --resource dummy_service --force-stop
  4. Run crm_mon in interactive mode so that you can wait until you see the node fail and a Failed Actions message is displayed. You should see the service restart on the alternate node.

    # crm_mon
    Stack: corosync
    Current DC: node1 (version 1.1.16-12.el7-94ff4df) - partition with quorum
    Last updated: Wed Jan 17 06:41:04 2018
    Last change: Wed Jan 17 06:39:02 2018 by root via cibadmin on node1
    
    2 nodes configured
    1 resource configured
    
    Online: [ node2 node1 ]
    
    Active resources:
    
    dummy_service    (ocf::pacemaker:Dummy): Started node1
    
    Failed Actions:
    * dummy_service_monitor_120000 on node2 'not running' (7): call=16, status=complete, exitreason='none',
        last-rc-change='Wed Jan 17 06:41:02 2018', queued=0ms, exec=0ms

    You can use the Ctrl-C key combination to exit out of crm_mon at any point.

  5. You can try to reboot the node where the service is running to see that failover also occurs in the case of node failure. Note that if you have not enabled the corosync and pacemaker services to start on boot, you may need to start the service on the node that you have rebooted, manually. For example:

    # pcs cluster start node1