A failed Pacemaker resource can leave a critical service stopped, migrated, or blocked by failure counters, reducing high-availability protections until the failure state is cleared. Recovering cleanly restores scheduler control so the cluster can resume automatic restarts and failover behavior.

When a start, stop, or monitor operation fails, Pacemaker records the failed action and increments failcounts per node. The pcs CLI queries the cluster status, shows failed actions, and can request cleanup to reset the recorded failures so the scheduler can retry the resource.

Clearing failures without fixing the root cause can trigger repeated restart attempts and may cascade into dependent resources through ordering or colocation constraints. Review the resource definition and relevant logs first, then perform a targeted cleanup for the specific resource that failed.

Steps to recover from a Pacemaker resource failure with PCS:

  1. Open a terminal on a cluster node with sudo privileges.
  2. Review cluster status to identify the failed resource, including the node reporting the failure.
    $ sudo pcs status --full
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum
      * Last updated: Wed Dec 31 09:17:21 2025 on node-01
      * Last change:  Wed Dec 31 09:15:47 2025 by root via cibadmin on node-01
      * 3 nodes configured
      * 2 resource instances configured
    
    Node List:
      * Node node-01 (1): online, feature set 3.17.4
      * Node node-02 (2): online, feature set 3.17.4
      * Node node-03 (3): online, feature set 3.17.4
    
    Full List of Resources:
      * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
      * web-service (systemd:nginx): Started node-02
    
    Migration Summary:
      * Node: node-02 (2):
        * web-service: migration-threshold=1000000 fail-count=1 last-failure='Wed Dec 31 09:16:15 2025'
    
    Failed Resource Actions:
      * web-service_monitor_30000 on node-02 'not running' (7): call=54, status='complete', exitreason='inactive', last-rc-change='Wed Dec 31 09:16:15 2025', queued=0ms, exec=0ms
    
    Tickets:
    
    PCSD Status:
      node-01: Online
      node-02: Online
      node-03: Online
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled

    Use the resource ID from this output (for example web-service) in later commands.

  3. Inspect the resource definition for incorrect parameters or operation settings.
    $ sudo pcs resource config web-service
    Resource: web-service (class=systemd type=nginx)
      Operations:
        monitor: web-service-monitor-interval-30s
          interval=30s
        start: web-service-start-interval-0s
          interval=0s timeout=100
        stop: web-service-stop-interval-0s
          interval=0s timeout=100
  4. Review recent Pacemaker logs on the failing node around the failure time.
    $ sudo journalctl -u pacemaker --since "10 min ago" --no-pager | grep -E 'last-failure|fail-count' | head -n 2
    Dec 31 09:16:15 node-02 pacemaker-attrd[1872]:  notice: Setting last-failure-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1767172575
    Dec 31 09:16:15 node-02 pacemaker-attrd[1872]:  notice: Setting fail-count-web-service#monitor_30000[node-02] in instance_attributes: (unset) -> 1

    On systems without journald, check /var/log/pacemaker/pacemaker.log or /var/log/messages for the failure reason.

  5. Clear the resource failure history to reset failcounts, unblocking recovery.
    $ sudo pcs resource cleanup web-service
    Cleaned up web-service on node-02
    Cleaned up web-service on node-03
    Cleaned up web-service on node-01
    Waiting for 1 reply from the controller
    ... got reply (done)

    Running cleanup without a resource name can clear failures for all resources, potentially triggering multiple restarts across the cluster.

  6. Restart the resource if it remains stopped after cleanup.
    $ sudo pcs resource restart web-service --wait=120
    web-service successfully restarted

    Restarting a failed resource can briefly interrupt service, potentially restarting dependent resources.

  7. Confirm the cluster status shows the resource started with no failed actions.
    $ sudo pcs status --full
    Cluster name: clustername
    Cluster Summary:
      * Stack: corosync (Pacemaker is running)
      * Current DC: node-03 (3) (version 2.1.6-6fdc9deea29) - partition with quorum
      * Last updated: Wed Dec 31 09:17:29 2025 on node-01
      * Last change:  Wed Dec 31 09:17:26 2025 by root via crm_resource on node-01
      * 3 nodes configured
      * 2 resource instances configured
    
    Node List:
      * Node node-01 (1): online, feature set 3.17.4
      * Node node-02 (2): online, feature set 3.17.4
      * Node node-03 (3): online, feature set 3.17.4
    
    Full List of Resources:
      * cluster_ip (ocf:heartbeat:IPaddr2): Started node-02
      * web-service (systemd:nginx): Started node-01
    
    Migration Summary:
    
    Tickets:
    
    PCSD Status:
      node-01: Online
      node-02: Online
      node-03: Online
    
    Daemon Status:
      corosync: active/enabled
      pacemaker: active/enabled
      pcsd: active/enabled