Pacemaker records resource action failures so the cluster can stop retrying a broken service and avoid repeated disruption. After a transient outage or after repairs, stale failure records can keep a resource stopped or prevent it from running on a specific node even though the underlying issue is resolved.
Failure history is stored per node as attributes such as fail-count and last-failure for each operation (start/stop/monitor). These values are evaluated with policies like migration-threshold and failure-timeout to determine whether a resource should be restarted, moved, or left in a failed state.
Running pcs resource cleanup clears the failure counters and triggers fresh probes, making the resource eligible for recovery under current policy. Cleanup does not fix the root cause, and running it on an unstable resource can cause restart loops or repeated failover, so log review and dependency checks should happen before clearing failures on production clusters.
Steps to clear Pacemaker resource failures:
- Review failed actions for the resource.
$ sudo pcs status --full Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Wed Dec 31 09:07:07 2025 on node-01 * Last change: Wed Dec 31 09:05:48 2025 by root via cibadmin on node-01 * 3 nodes configured * 2 resource instances configured Node List: * Node node-01 (1): online, feature set 3.17.4 * Node node-02 (2): online, feature set 3.17.4 * Node node-03 (3): online, feature set 3.17.4 Full List of Resources: * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02 * web-service (systemd:nginx): Started node-02 Migration Summary: * Node: node-02 (2): * web-service: migration-threshold=1000000 fail-count=1 last-failure='Wed Dec 31 09:06:25 2025' Failed Resource Actions: * web-service_monitor_30000 on node-02 'not running' (7): call=28, status='complete', exitreason='inactive', last-rc-change='Wed Dec 31 09:06:25 2025', queued=0ms, exec=0msThe node name in Failed Resource Actions identifies where the failing operation was recorded.
- Check recent Pacemaker logs on the node that recorded the failure.
$ sudo journalctl -u pacemaker --since "10 min ago" | grep -E 'not running|fail-count|last-failure' | tail -n 3 Dec 31 09:06:25 node-02 pacemaker-attrd[1872]: notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1 Dec 31 09:07:08 node-02 pacemaker-attrd[1872]: notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: 1 -> (unset) Dec 31 09:07:08 node-02 pacemaker-attrd[1872]: notice: Setting last-failure-web-service#monitor_30000[node-02] in instance_attributes: 1767171985 -> (unset)
Some distributions log to /var/log/pacemaker/pacemaker.log or syslog instead of journald.
- Resolve the underlying cause of the failed action before clearing failure history.
Common causes include missing mounts, conflicting virtual IPs, stopped systemd units, failing monitor scripts, name resolution problems, and permission issues.
- Clear the failure history for the resource.
$ sudo pcs resource cleanup web-service Cleaned up web-service on node-02 Cleaned up web-service on node-03 Cleaned up web-service on node-01 Waiting for 1 reply from the controller ... got reply (done)
Clearing failures can immediately trigger restarts or relocation according to policy, and running cleanup without a resource ID clears failure records cluster-wide.
- Confirm the failure records are cleared for the resource.
$ sudo pcs status --full Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Wed Dec 31 09:07:08 2025 on node-01 * Last change: Wed Dec 31 09:07:08 2025 by hacluster via crmd on node-02 * 3 nodes configured * 2 resource instances configured Node List: * Node node-01 (1): online, feature set 3.17.4 * Node node-02 (2): online, feature set 3.17.4 * Node node-03 (3): online, feature set 3.17.4 Full List of Resources: * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02 * web-service (systemd:nginx): Started node-02 Migration Summary: Tickets: PCSD Status: node-01: Online node-02: Online node-03: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
The Failed Resource Actions section disappears when no failures are recorded.
- Monitor cluster status for recurring failures.
$ watch -n 2 sudo pcs status --full
Press Ctrl+C to stop watch.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
