A failed Pacemaker resource can leave a critical service stopped, migrated, or blocked by failure counters, reducing high-availability protections until the failure state is cleared. Recovering cleanly restores scheduler control so the cluster can resume automatic restarts and failover behavior.
When a start, stop, or monitor operation fails, Pacemaker records the failed action and increments failcounts per node. The pcs CLI queries the cluster status, shows failed actions, and can request cleanup to reset the recorded failures so the scheduler can retry the resource.
Clearing failures without fixing the root cause can trigger repeated restart attempts and may cascade into dependent resources through ordering or colocation constraints. Review the resource definition and relevant logs first, then perform a targeted cleanup for the specific resource that failed.
$ sudo pcs status --full
Cluster name: clustername
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum
* Last updated: Wed Dec 31 09:17:21 2025 on node-01
* Last change: Wed Dec 31 09:15:47 2025 by root via cibadmin on node-01
* 3 nodes configured
* 2 resource instances configured
Node List:
* Node node-01 (1): online, feature set 3.17.4
* Node node-02 (2): online, feature set 3.17.4
* Node node-03 (3): online, feature set 3.17.4
Full List of Resources:
* cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
* web-service (systemd:nginx): Started node-02
Migration Summary:
* Node: node-02 (2):
* web-service: migration-threshold=1000000 fail-count=1 last-failure='Wed Dec 31 09:16:15 2025'
Failed Resource Actions:
* web-service_monitor_30000 on node-02 'not running' (7): call=54, status='complete', exitreason='inactive', last-rc-change='Wed Dec 31 09:16:15 2025', queued=0ms, exec=0ms
Tickets:
PCSD Status:
node-01: Online
node-02: Online
node-03: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Use the resource ID from this output (for example web-service) in later commands.
$ sudo pcs resource config web-service
Resource: web-service (class=systemd type=nginx)
Operations:
monitor: web-service-monitor-interval-30s
interval=30s
start: web-service-start-interval-0s
interval=0s timeout=100
stop: web-service-stop-interval-0s
interval=0s timeout=100
$ sudo journalctl -u pacemaker --since "10 min ago" --no-pager | grep -E 'last-failure|fail-count' | head -n 2 Dec 31 09:16:15 node-02 pacemaker-attrd[1872]: notice: Setting last-failure-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1767172575 Dec 31 09:16:15 node-02 pacemaker-attrd[1872]: notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1
On systems without journald, check /var/log/pacemaker/pacemaker.log or /var/log/messages for the failure reason.
$ sudo pcs resource cleanup web-service Cleaned up web-service on node-02 Cleaned up web-service on node-03 Cleaned up web-service on node-01 Waiting for 1 reply from the controller ... got reply (done)
Running cleanup without a resource name can clear failures for all resources, potentially triggering multiple restarts across the cluster.
$ sudo pcs resource restart web-service --wait=120 web-service successfully restarted
Restarting a failed resource can briefly interrupt service, potentially restarting dependent resources.
$ sudo pcs status --full Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Wed Dec 31 09:17:29 2025 on node-01 * Last change: Wed Dec 31 09:17:26 2025 by root via crm_resource on node-01 * 3 nodes configured * 2 resource instances configured Node List: * Node node-01 (1): online, feature set 3.17.4 * Node node-02 (2): online, feature set 3.17.4 * Node node-03 (3): online, feature set 3.17.4 Full List of Resources: * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02 * web-service (systemd:nginx): Started node-01 Migration Summary: Tickets: PCSD Status: node-01: Online node-02: Online node-03: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled