A failed Pacemaker resource can leave a critical service stopped, migrated, or blocked by failure counters, reducing high-availability protections until the failure state is cleared. Recovering cleanly restores scheduler control so the cluster can resume automatic restarts and failover behavior.
When a start, stop, or monitor operation fails, Pacemaker records the failed action and increments failcounts per node. The pcs CLI queries the cluster status, shows failed actions, and can request cleanup to reset the recorded failures so the scheduler can retry the resource.
Clearing failures without fixing the root cause can trigger repeated restart attempts and may cascade into dependent resources through ordering or colocation constraints. Review the resource definition and relevant logs first, then perform a targeted cleanup for the specific resource that failed.
Steps to recover from a Pacemaker resource failure with PCS:
- Open a terminal on a cluster node with sudo privileges.
- Review cluster status to identify the failed resource, including the node reporting the failure.
$ sudo pcs status --full Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Wed Dec 31 09:17:21 2025 on node-01 * Last change: Wed Dec 31 09:15:47 2025 by root via cibadmin on node-01 * 3 nodes configured * 2 resource instances configured Node List: * Node node-01 (1): online, feature set 3.17.4 * Node node-02 (2): online, feature set 3.17.4 * Node node-03 (3): online, feature set 3.17.4 Full List of Resources: * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02 * web-service (systemd:nginx): Started node-02 Migration Summary: * Node: node-02 (2): * web-service: migration-threshold=1000000 fail-count=1 last-failure='Wed Dec 31 09:16:15 2025' Failed Resource Actions: * web-service_monitor_30000 on node-02 'not running' (7): call=54, status='complete', exitreason='inactive', last-rc-change='Wed Dec 31 09:16:15 2025', queued=0ms, exec=0ms Tickets: PCSD Status: node-01: Online node-02: Online node-03: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabledUse the resource ID from this output (for example web-service) in later commands.
- Inspect the resource definition for incorrect parameters or operation settings.
$ sudo pcs resource config web-service Resource: web-service (class=systemd type=nginx) Operations: monitor: web-service-monitor-interval-30s interval=30s start: web-service-start-interval-0s interval=0s timeout=100 stop: web-service-stop-interval-0s interval=0s timeout=100 - Review recent Pacemaker logs on the failing node around the failure time.
$ sudo journalctl -u pacemaker --since "10 min ago" --no-pager | grep -E 'last-failure|fail-count' | head -n 2 Dec 31 09:16:15 node-02 pacemaker-attrd[1872]: notice: Setting last-failure-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1767172575 Dec 31 09:16:15 node-02 pacemaker-attrd[1872]: notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1
On systems without journald, check /var/log/pacemaker/pacemaker.log or /var/log/messages for the failure reason.
- Clear the resource failure history to reset failcounts, unblocking recovery.
$ sudo pcs resource cleanup web-service Cleaned up web-service on node-02 Cleaned up web-service on node-03 Cleaned up web-service on node-01 Waiting for 1 reply from the controller ... got reply (done)
Running cleanup without a resource name can clear failures for all resources, potentially triggering multiple restarts across the cluster.
- Restart the resource if it remains stopped after cleanup.
$ sudo pcs resource restart web-service --wait=120 web-service successfully restarted
Restarting a failed resource can briefly interrupt service, potentially restarting dependent resources.
- Confirm the cluster status shows the resource started with no failed actions.
$ sudo pcs status --full Cluster name: clustername Cluster Summary: * Stack: corosync (Pacemaker is running) * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum * Last updated: Wed Dec 31 09:17:29 2025 on node-01 * Last change: Wed Dec 31 09:17:26 2025 by root via crm_resource on node-01 * 3 nodes configured * 2 resource instances configured Node List: * Node node-01 (1): online, feature set 3.17.4 * Node node-02 (2): online, feature set 3.17.4 * Node node-03 (3): online, feature set 3.17.4 Full List of Resources: * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02 * web-service (systemd:nginx): Started node-01 Migration Summary: Tickets: PCSD Status: node-01: Online node-02: Online node-03: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
