A full power or site outage can leave a Pacemaker cluster unable to safely manage resources until membership, quorum, and fencing are re-established, making controlled recovery essential for predictable service restoration.

Most Pacemaker deployments rely on Corosync for cluster membership and messaging, then elect a Designated Controller (DC) once quorum is reached and schedule resources based on configured constraints. The pcs CLI coordinates starting the cluster stack across nodes and provides a consolidated view of quorum, node state, and resource placement.

Recovery is most reliable when shared dependencies are stable first, including cluster networks, shared storage backends, and fencing device connectivity. Starting the cluster while dependencies are still unhealthy can stack up resource failures or trigger STONITH fencing loops, so keeping console or out-of-band access available during recovery reduces lockout risk.

Steps to recover a Pacemaker cluster after a full outage:

  1. Confirm each cluster node is reachable on the cluster network.
    $ ping -c 1 node-01
    PING node-01 (192.0.2.11) 56(84) bytes of data.
    64 bytes from node-01 (192.0.2.11): icmp_seq=1 ttl=64 time=0.021 ms
    
    --- node-01 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.021/0.021/0.021/0.000 ms
    
    $ ping -c 1 node-02
    PING node-02 (192.0.2.12) 56(84) bytes of data.
    64 bytes from node-02.sg-pacemaker-20251231 (192.0.2.12): icmp_seq=1 ttl=64 time=0.037 ms
    
    --- node-02 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.037/0.037/0.037/0.000 ms
    
    $ ping -c 1 node-03
    PING node-03 (192.0.2.13) 56(84) bytes of data.
    64 bytes from node-03.sg-pacemaker-20251231 (192.0.2.13): icmp_seq=1 ttl=64 time=0.035 ms
    
    --- node-03 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.035/0.035/0.035/0.000 ms

    Confirm shared storage and fencing device connectivity before starting cluster services.

  2. Stop Pacemaker and Corosync on all nodes to simulate a full outage.
    $ sudo systemctl stop pacemaker corosync

    Stopping the cluster stack interrupts all clustered services.

  3. Check the current cluster state from a node with pcs access.
    $ sudo pcs status
    Error: error running crm_mon, is pacemaker running?
      crm_mon: Connection to cluster failed: Connection refused

    Clusters that are already running show DC, quorum, nodes, and resources in the same output.

  4. Start the cluster services on all nodes.
    $ sudo systemctl start corosync pacemaker pcsd

    Starting the cluster before storage, VLANs, or fencing devices are ready can cause repeated resource failures and recovery delays.

  5. Confirm the cluster stack is running even if nodes are still recovering.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: NONE
      * Last updated: Thu Jan  1 20:26:49 2026 on node-01
      * Last change:  Thu Jan  1 20:23:03 2026 by root via cibadmin on node-01
      * 4 nodes configured
      * 5 resource instances configured
    
    Node List:
      * Node node-01: UNCLEAN (offline)
      * Node node-02: UNCLEAN (offline)
      * Node node-03: UNCLEAN (offline)
      * Node node-04: UNCLEAN (offline)
    
    Full List of Resources:
      * fence-dummy-node-01	(stonith:fence_dummy):	 Stopped
      * fence-dummy-node-02	(stonith:fence_dummy):	 Stopped
      * fence-dummy-node-03	(stonith:fence_dummy):	 Stopped
      * vip	(ocf:heartbeat:IPaddr2):	 Stopped
      * app	(systemd:app):	 Stopped
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled

    UNCLEAN nodes commonly appear immediately after a full outage and clear after cleanup and fencing recovery.

  6. Confirm the current cluster properties.
    $ sudo pcs property config
    Cluster Properties: cib-bootstrap-options
      cluster-infrastructure=corosync
      cluster-name=clustername
      dc-version=2.1.6-6fdc9deea29
      have-watchdog=false
      last-lrm-refresh=1767240361
      no-quorum-policy=stop
      stonith-enabled=true
  7. Verify fencing resources before restarting services.
    $ sudo pcs stonith status
      * fence-dummy-node-01	(stonith:fence_dummy):	 Stopped
      * fence-dummy-node-02	(stonith:fence_dummy):	 Stopped
      * fence-dummy-node-03	(stonith:fence_dummy):	 Stopped

    Unhealthy fencing devices can trigger unexpected node reboots during recovery.

  8. Review resource states and placement after the cluster forms.
    $ sudo pcs status resources
      * vip	(ocf:heartbeat:IPaddr2):	 Stopped
      * app	(systemd:app):	 Stopped

    Stopped resources often indicate a dependency issue or a recorded failure that blocks restart.

  9. Clear stale failure history after underlying issues are fixed.
    $ sudo pcs resource cleanup
    Cleaned up all resources on all nodes

    Cleanup resets failure counters and can immediately retry starts, so run only after the root cause is resolved.

  10. Temporarily disable fencing while resources are restarted.
    $ sudo pcs property set stonith-enabled=false

    Disable stonith-enabled only long enough to recover services, then re-enable it immediately.

  11. Re-enable critical resources after cleanup.
    $ sudo pcs resource enable vip
    $ sudo pcs resource enable app
  12. Confirm resources return to Started state.
    $ sudo pcs status resources
      * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
      * app	(systemd:app):	 Started node-02
  13. Re-enable fencing and verify the devices start.
    $ sudo pcs property set stonith-enabled=true
    $ sudo pcs stonith status
      * fence-dummy-node-01	(stonith:fence_dummy):	 Started node-01
      * fence-dummy-node-02	(stonith:fence_dummy):	 Started node-02
      * fence-dummy-node-03	(stonith:fence_dummy):	 Started node-03
  14. Verify the cluster reports quorum after recovery.
    $ sudo pcs status
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-02 (version 2.1.6-6fdc9deea29) - partition with quorum
      * Last updated: Thu Jan  1 20:27:09 2026 on node-01
      * Last change:  Thu Jan  1 20:27:08 2026 by root via cibadmin on node-01
      * 4 nodes configured
      * 5 resource instances configured
    
    Node List:
      * Node node-04: UNCLEAN (offline)
      * Online: [ node-01 node-02 node-03 ]
    
    Full List of Resources:
      * fence-dummy-node-01	(stonith:fence_dummy):	 Started node-01
      * fence-dummy-node-02	(stonith:fence_dummy):	 Started node-02
      * fence-dummy-node-03	(stonith:fence_dummy):	 Started node-03
      * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
      * app	(systemd:app):	 Started node-02
    
    ##### snipped #####

    partition with quorum indicates a majority of cluster votes is present.

  15. Confirm expected resources are started.
    $ sudo pcs status resources
      * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
      * app	(systemd:app):	 Started node-02