Failover testing proves that a Pacemaker cluster can keep services available when a node is drained for maintenance or becomes unreachable. Controlled failovers expose missing dependencies, fragile constraints, and slow resource start times before a real outage forces an unplanned cutover.

The pcs CLI talks to the cluster configuration and drives state changes such as putting a node into standby. When a node enters standby, Pacemaker recalculates placement, stops resources on that node, and starts them on other eligible nodes while Corosync maintains membership and quorum.

Node-level failovers can interrupt active sessions and can trigger fencing if quorum is lost or the cluster decides a node is unsafe. Run the test during a maintenance window, confirm remaining nodes can carry the workload, and prefer a single-resource move when only one service needs to be exercised.

Steps to run a Pacemaker failover test with PCS:

  1. Confirm the cluster has quorum with no failed actions.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-01 (version 2.1.6-6fdc9deea29) - partition with quorum
      * 3 nodes configured
    ##### snipped #####

    Look for partition with quorum and no failed actions before proceeding.

  2. List resources to record the current placement of the target service.
    $ sudo pcs status resources
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-01
        * web-service (systemd:nginx): Started node-01
  3. Put the hosting node into standby to drain its resources.
    $ sudo pcs node standby node-01

    Standby can restart services on other nodes and drop active sessions; loss of quorum can stop resources or trigger fencing depending on cluster policy.

  4. Verify resources relocated off the standby node and record the elapsed relocation time if the exercise is being compared with an RTO.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-01 (version 2.1.6-6fdc9deea29) - partition with quorum
    ##### snipped #####
    Node List:
      * Node node-01: standby
      * Online: [ node-02 node-03 ]
    
    Full List of Resources:
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
        * web-service (systemd:nginx): Started node-02
  5. Return the node to active service.
    $ sudo pcs node unstandby node-01
  6. Confirm the cluster is healthy at the end of the failover test.
    $ sudo pcs status
    ##### snipped #####
    Node List:
      * Online: [ node-01 node-02 node-03 ]
    Full List of Resources:
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
        * web-service (systemd:nginx): Started node-02

    Resources may remain on the new node after unstandby due to stickiness and placement rules. If failures appear, clear them with pcs resource cleanup <resource> before re-testing.