How to recover from a Pacemaker resource failure with PCS

A failed Pacemaker resource can leave a critical service stopped, migrated, or blocked by failure counters, reducing high-availability protections until the failure state is cleared. Recovering cleanly restores scheduler control so the cluster can resume automatic restarts and failover behavior.

When a start, stop, or monitor operation fails, Pacemaker records the failed action and increments failcounts per node. The pcs CLI queries the cluster status, shows failed actions, and can request cleanup to reset the recorded failures so the scheduler can retry the resource.

Clearing failures without fixing the root cause can trigger repeated restart attempts and may cascade into dependent resources through ordering or colocation constraints. Review the resource definition and relevant logs first, then perform a targeted cleanup for the specific resource that failed.

Steps to recover from a Pacemaker resource failure with PCS:

Open a terminal on a cluster node with sudo privileges.

Review cluster status to identify the failed resource, including the node reporting the failure.

$ sudo pcs status --full
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Wed Dec 31 09:17:21 2025 on node-01
  * Last change:  Wed Dec 31 09:15:47 2025 by root via cibadmin on node-01
  * 3 nodes configured
  * 2 resource instances configured

Node List:
  * Node node-01 (1): online, feature set 3.17.4
  * Node node-02 (2): online, feature set 3.17.4
  * Node node-03 (3): online, feature set 3.17.4

Full List of Resources:
  * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
  * web-service (systemd:nginx): Started node-02

Migration Summary:
  * Node: node-02 (2):
    * web-service: migration-threshold=1000000 fail-count=1 last-failure='Wed Dec 31 09:16:15 2025'

Failed Resource Actions:
  * web-service_monitor_30000 on node-02 'not running' (7): call=54, status='complete', exitreason='inactive', last-rc-change='Wed Dec 31 09:16:15 2025', queued=0ms, exec=0ms

Tickets:

PCSD Status:
  node-01: Online
  node-02: Online
  node-03: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Use the resource ID from this output (for example web-service) in later commands.

Inspect the resource definition for incorrect parameters or operation settings.

$ sudo pcs resource config web-service
Resource: web-service (class=systemd type=nginx)
  Operations:
    monitor: web-service-monitor-interval-30s
      interval=30s
    start: web-service-start-interval-0s
      interval=0s timeout=100
    stop: web-service-stop-interval-0s
      interval=0s timeout=100

Review recent Pacemaker logs on the failing node around the failure time.

$ sudo journalctl -u pacemaker --since "10 min ago" --no-pager | grep -E 'last-failure|fail-count' | head -n 2
Dec 31 09:16:15 node-02 pacemaker-attrd[1872]:  notice: Setting last-failure-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1767172575
Dec 31 09:16:15 node-02 pacemaker-attrd[1872]:  notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1

On systems without journald, check /var/log/pacemaker/pacemaker.log or /var/log/messages for the failure reason.

Clear the resource failure history to reset failcounts, unblocking recovery.
```
$ sudo pcs resource cleanup web-service
Cleaned up web-service on node-02
Cleaned up web-service on node-03
Cleaned up web-service on node-01
Waiting for 1 reply from the controller
... got reply (done)
```
Running cleanup without a resource name can clear failures for all resources, potentially triggering multiple restarts across the cluster.

Related: How to clear Pacemaker resource failures
Restart the resource if it remains stopped after cleanup.
```
$ sudo pcs resource restart web-service --wait=120
web-service successfully restarted
```
Restarting a failed resource can briefly interrupt service, potentially restarting dependent resources.

Confirm the cluster status shows the resource started with no failed actions.

$ sudo pcs status --full
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Wed Dec 31 09:17:29 2025 on node-01
  * Last change:  Wed Dec 31 09:17:26 2025 by root via crm_resource on node-01
  * 3 nodes configured
  * 2 resource instances configured

Node List:
  * Node node-01 (1): online, feature set 3.17.4
  * Node node-02 (2): online, feature set 3.17.4
  * Node node-03 (3): online, feature set 3.17.4

Full List of Resources:
  * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
  * web-service (systemd:nginx): Started node-01

Migration Summary:

Tickets:

PCSD Status:
  node-01: Online
  node-02: Online
  node-03: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.