How to recover a Pacemaker cluster after a full outage

A full power or site outage can leave a Pacemaker cluster unable to safely manage resources until membership, quorum, and fencing are re-established, making controlled recovery essential for predictable service restoration.

Most Pacemaker deployments rely on Corosync for cluster membership and messaging, then elect a Designated Controller (DC) once quorum is reached and schedule resources based on configured constraints. The pcs CLI coordinates starting the cluster stack across nodes and provides a consolidated view of quorum, node state, and resource placement.

Recovery is most reliable when shared dependencies are stable first, including cluster networks, shared storage backends, and fencing device connectivity. Starting the cluster while dependencies are still unhealthy can stack up resource failures or trigger STONITH fencing loops, so keeping console or out-of-band access available during recovery reduces lockout risk.

Steps to recover a Pacemaker cluster after a full outage:

Confirm each cluster node is reachable on the cluster network.

$ ping -c 1 node-01
PING node-01 (192.0.2.11) 56(84) bytes of data.
64 bytes from node-01 (192.0.2.11): icmp_seq=1 ttl=64 time=0.021 ms

--- node-01 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.021/0.021/0.021/0.000 ms

$ ping -c 1 node-02
PING node-02 (192.0.2.12) 56(84) bytes of data.
64 bytes from node-02.sg-pacemaker-20251231 (192.0.2.12): icmp_seq=1 ttl=64 time=0.037 ms

--- node-02 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.037/0.037/0.037/0.000 ms

$ ping -c 1 node-03
PING node-03 (192.0.2.13) 56(84) bytes of data.
64 bytes from node-03.sg-pacemaker-20251231 (192.0.2.13): icmp_seq=1 ttl=64 time=0.035 ms

--- node-03 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.035/0.035/0.035/0.000 ms

Confirm shared storage and fencing device connectivity before starting cluster services.

Stop Pacemaker and Corosync on all nodes to simulate a full outage.
```
$ sudo systemctl stop pacemaker corosync
```
Stopping the cluster stack interrupts all clustered services.
Check the current cluster state from a node with pcs access.
```
$ sudo pcs status
Error: error running crm_mon, is pacemaker running?
  crm_mon: Connection to cluster failed: Connection refused
```
Clusters that are already running show DC, quorum, nodes, and resources in the same output.
Start the cluster services on all nodes.
```
$ sudo systemctl start corosync pacemaker pcsd
```
Starting the cluster before storage, VLANs, or fencing devices are ready can cause repeated resource failures and recovery delays.

Confirm the cluster stack is running even if nodes are still recovering.

$ sudo pcs status
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: NONE
  * Last updated: Thu Jan  1 20:26:49 2026 on node-01
  * Last change:  Thu Jan  1 20:23:03 2026 by root via cibadmin on node-01
  * 4 nodes configured
  * 5 resource instances configured

Node List:
  * Node node-01: UNCLEAN (offline)
  * Node node-02: UNCLEAN (offline)
  * Node node-03: UNCLEAN (offline)
  * Node node-04: UNCLEAN (offline)

Full List of Resources:
  * fence-dummy-node-01	(stonith:fence_dummy):	 Stopped
  * fence-dummy-node-02	(stonith:fence_dummy):	 Stopped
  * fence-dummy-node-03	(stonith:fence_dummy):	 Stopped
  * vip	(ocf:heartbeat:IPaddr2):	 Stopped
  * app	(systemd:app):	 Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

UNCLEAN nodes commonly appear immediately after a full outage and clear after cleanup and fencing recovery.

Confirm the current cluster properties.

$ sudo pcs property config
Cluster Properties: cib-bootstrap-options
  cluster-infrastructure=corosync
  cluster-name=clustername
  dc-version=2.1.6-6fdc9deea29
  have-watchdog=false
  last-lrm-refresh=1767240361
  no-quorum-policy=stop
  stonith-enabled=true

Verify fencing resources before restarting services.

$ sudo pcs stonith status
  * fence-dummy-node-01	(stonith:fence_dummy):	 Stopped
  * fence-dummy-node-02	(stonith:fence_dummy):	 Stopped
  * fence-dummy-node-03	(stonith:fence_dummy):	 Stopped

Unhealthy fencing devices can trigger unexpected node reboots during recovery.

Review resource states and placement after the cluster forms.
```
$ sudo pcs status resources
  * vip	(ocf:heartbeat:IPaddr2):	 Stopped
  * app	(systemd:app):	 Stopped
```
Stopped resources often indicate a dependency issue or a recorded failure that blocks restart.
Clear stale failure history after underlying issues are fixed.
```
$ sudo pcs resource cleanup
Cleaned up all resources on all nodes
```
Cleanup resets failure counters and can immediately retry starts, so run only after the root cause is resolved.

Related: How to clear Pacemaker resource failures
Temporarily disable fencing while resources are restarted.
```
$ sudo pcs property set stonith-enabled=false
```
Disable stonith-enabled only long enough to recover services, then re-enable it immediately.

Re-enable critical resources after cleanup.

$ sudo pcs resource enable vip
$ sudo pcs resource enable app

Confirm resources return to Started state.

$ sudo pcs status resources
  * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
  * app	(systemd:app):	 Started node-02

Re-enable fencing and verify the devices start.

$ sudo pcs property set stonith-enabled=true
$ sudo pcs stonith status
  * fence-dummy-node-01	(stonith:fence_dummy):	 Started node-01
  * fence-dummy-node-02	(stonith:fence_dummy):	 Started node-02
  * fence-dummy-node-03	(stonith:fence_dummy):	 Started node-03

Verify the cluster reports quorum after recovery.

$ sudo pcs status
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-02 (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Thu Jan  1 20:27:09 2026 on node-01
  * Last change:  Thu Jan  1 20:27:08 2026 by root via cibadmin on node-01
  * 4 nodes configured
  * 5 resource instances configured

Node List:
  * Node node-04: UNCLEAN (offline)
  * Online: [ node-01 node-02 node-03 ]

Full List of Resources:
  * fence-dummy-node-01	(stonith:fence_dummy):	 Started node-01
  * fence-dummy-node-02	(stonith:fence_dummy):	 Started node-02
  * fence-dummy-node-03	(stonith:fence_dummy):	 Started node-03
  * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
  * app	(systemd:app):	 Started node-02

##### snipped #####

partition with quorum indicates a majority of cluster votes is present.

Confirm expected resources are started.

$ sudo pcs status resources
  * vip	(ocf:heartbeat:IPaddr2):	 Started node-01
  * app	(systemd:app):	 Started node-02

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.