Failover testing proves that a Pacemaker cluster can keep services available when a node is drained for maintenance or becomes unreachable. Controlled failovers expose missing dependencies, fragile constraints, and slow resource start times before a real outage forces an unplanned cutover.

The pcs CLI talks to the cluster configuration and drives state changes such as putting a node into standby. When a node enters standby, Pacemaker recalculates placement, stops resources on that node, and starts them on other eligible nodes while Corosync maintains membership and quorum.

Node-level failovers can interrupt active sessions and can trigger fencing if quorum is lost or the cluster decides a node is unsafe. Run the test during a maintenance window, confirm remaining nodes can carry the workload, and prefer a single-resource move when only one service needs to be exercised.

Steps to run a Pacemaker failover test with PCS:

  1. Confirm the cluster has quorum with no failed actions.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-01 (version 2.1.6-6fdc9deea29) - partition with quorum
      * 3 nodes configured
    ##### snipped #####

    Look for partition with quorum and no failed actions before proceeding.

  2. List resources to record the current placement of the target service.
    $ sudo pcs status resources
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-01
        * web-service (systemd:nginx): Started node-01
  3. Put the hosting node into standby to drain its resources.
    $ sudo pcs node standby node-01

    Standby can restart services on other nodes and drop active sessions; loss of quorum can stop resources or trigger fencing depending on cluster policy.

  4. Verify resources relocated off the standby node.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-01 (version 2.1.6-6fdc9deea29) - partition with quorum
    ##### snipped #####
    Node List:
      * Node node-01: standby
      * Online: [ node-02 node-03 ]
    
    Full List of Resources:
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
        * web-service (systemd:nginx): Started node-02
  5. Return the node to active service.
    $ sudo pcs node unstandby node-01
  6. Confirm the cluster is healthy at the end of the failover test.
    $ sudo pcs status
    ##### snipped #####
    Node List:
      * Online: [ node-01 node-02 node-03 ]
    Full List of Resources:
      * Resource Group: web-stack:
        * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
        * web-service (systemd:nginx): Started node-02

    Resources may remain on the new node after unstandby due to stickiness and placement rules. If failures appear, clear them with pcs resource cleanup <resource> before re-testing.