A full power or site outage can leave a Pacemaker cluster unable to safely manage resources until membership, quorum, and fencing are re-established, making controlled recovery essential for predictable service restoration.
Most Pacemaker deployments rely on Corosync for cluster membership and messaging, then elect a Designated Controller (DC) once quorum is reached and schedule resources based on configured constraints. The pcs CLI coordinates starting the cluster stack across nodes and provides a consolidated view of quorum, node state, and resource placement.
Recovery is most reliable when shared dependencies are stable first, including cluster networks, shared storage backends, and fencing device connectivity. Starting the cluster while dependencies are still unhealthy can stack up resource failures or trigger STONITH fencing loops, so keeping console or out-of-band access available during recovery reduces lockout risk.
Steps to recover a Pacemaker cluster after a full outage:
- Confirm each cluster node is reachable on the cluster network.
$ ping -c 1 node-01 PING node-01 (192.0.2.11) 56(84) bytes of data. 64 bytes from node-01 (192.0.2.11): icmp_seq=1 ttl=64 time=0.021 ms --- node-01 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.021/0.021/0.021/0.000 ms $ ping -c 1 node-02 PING node-02 (192.0.2.12) 56(84) bytes of data. 64 bytes from node-02.sg-pacemaker-20251231 (192.0.2.12): icmp_seq=1 ttl=64 time=0.037 ms --- node-02 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.037/0.037/0.037/0.000 ms $ ping -c 1 node-03 PING node-03 (192.0.2.13) 56(84) bytes of data. 64 bytes from node-03.sg-pacemaker-20251231 (192.0.2.13): icmp_seq=1 ttl=64 time=0.035 ms --- node-03 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.035/0.035/0.035/0.000 ms
Confirm shared storage and fencing device connectivity before starting cluster services.
- Stop Pacemaker and Corosync on all nodes to simulate a full outage.
$ sudo systemctl stop pacemaker corosync
Stopping the cluster stack interrupts all clustered services.
- Check the current cluster state from a node with pcs access.
$ sudo pcs status Error: error running crm_mon, is pacemaker running? crm_mon: Connection to cluster failed: Connection refused
Clusters that are already running show DC, quorum, nodes, and resources in the same output.
- Start the cluster services on all nodes.
$ sudo systemctl start corosync pacemaker pcsd
Starting the cluster before storage, VLANs, or fencing devices are ready can cause repeated resource failures and recovery delays.
- Confirm the cluster stack is running even if nodes are still recovering.
$ sudo pcs status Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: NONE * Last updated: Thu Jan 1 20:26:49 2026 on node-01 * Last change: Thu Jan 1 20:23:03 2026 by root via cibadmin on node-01 * 4 nodes configured * 5 resource instances configured Node List: * Node node-01: UNCLEAN (offline) * Node node-02: UNCLEAN (offline) * Node node-03: UNCLEAN (offline) * Node node-04: UNCLEAN (offline) Full List of Resources: * fence-dummy-node-01 (stonith:fence_dummy): Stopped * fence-dummy-node-02 (stonith:fence_dummy): Stopped * fence-dummy-node-03 (stonith:fence_dummy): Stopped * vip (ocf:heartbeat:IPaddr2): Stopped * app (systemd:app): Stopped Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
UNCLEAN nodes commonly appear immediately after a full outage and clear after cleanup and fencing recovery.
- Confirm the current cluster properties.
$ sudo pcs property config Cluster Properties: cib-bootstrap-options cluster-infrastructure=corosync cluster-name=clustername dc-version=2.1.6-6fdc9deea29 have-watchdog=false last-lrm-refresh=1767240361 no-quorum-policy=stop stonith-enabled=true
- Verify fencing resources before restarting services.
$ sudo pcs stonith status * fence-dummy-node-01 (stonith:fence_dummy): Stopped * fence-dummy-node-02 (stonith:fence_dummy): Stopped * fence-dummy-node-03 (stonith:fence_dummy): Stopped
Unhealthy fencing devices can trigger unexpected node reboots during recovery.
- Review resource states and placement after the cluster forms.
$ sudo pcs status resources * vip (ocf:heartbeat:IPaddr2): Stopped * app (systemd:app): Stopped
Stopped resources often indicate a dependency issue or a recorded failure that blocks restart.
- Clear stale failure history after underlying issues are fixed.
$ sudo pcs resource cleanup Cleaned up all resources on all nodes
Cleanup resets failure counters and can immediately retry starts, so run only after the root cause is resolved.
- Temporarily disable fencing while resources are restarted.
$ sudo pcs property set stonith-enabled=false
Disable stonith-enabled only long enough to recover services, then re-enable it immediately.
- Re-enable critical resources after cleanup.
$ sudo pcs resource enable vip $ sudo pcs resource enable app
- Confirm resources return to Started state.
$ sudo pcs status resources * vip (ocf:heartbeat:IPaddr2): Started node-01 * app (systemd:app): Started node-02
- Re-enable fencing and verify the devices start.
$ sudo pcs property set stonith-enabled=true $ sudo pcs stonith status * fence-dummy-node-01 (stonith:fence_dummy): Started node-01 * fence-dummy-node-02 (stonith:fence_dummy): Started node-02 * fence-dummy-node-03 (stonith:fence_dummy): Started node-03
- Verify the cluster reports quorum after recovery.
$ sudo pcs status Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-02 (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Thu Jan 1 20:27:09 2026 on node-01 * Last change: Thu Jan 1 20:27:08 2026 by root via cibadmin on node-01 * 4 nodes configured * 5 resource instances configured Node List: * Node node-04: UNCLEAN (offline) * Online: [ node-01 node-02 node-03 ] Full List of Resources: * fence-dummy-node-01 (stonith:fence_dummy): Started node-01 * fence-dummy-node-02 (stonith:fence_dummy): Started node-02 * fence-dummy-node-03 (stonith:fence_dummy): Started node-03 * vip (ocf:heartbeat:IPaddr2): Started node-01 * app (systemd:app): Started node-02 ##### snipped #####
partition with quorum indicates a majority of cluster votes is present.
- Confirm expected resources are started.
$ sudo pcs status resources * vip (ocf:heartbeat:IPaddr2): Started node-01 * app (systemd:app): Started node-02
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
