How to plan active-active high availability with Pacemaker

Active-active high availability runs the same service on multiple nodes so capacity scales and a single node failure becomes a partial slowdown instead of a full outage. It reduces failover impact and makes maintenance less disruptive because traffic can stay on surviving instances. The trade-off is that concurrency bugs and split-brain are extremely efficient at ruining a day.

In a Pacemaker cluster managed through pcs, active-active is usually modeled as a cloned resource so the scheduler can start and monitor parallel instances across nodes. For replicated state, a promotable resource adds roles (Promoted/Unpromoted) so the cluster can enforce writer/reader behavior while still handling failover. Constraints, resource meta attributes, and monitoring operations determine where instances run and how the cluster reacts to faults.

Traffic distribution, client affinity, and data consistency are not solved by Pacemaker alone. A single floating VIP is inherently active-passive, so active-active requires multiple front-end endpoints (node IPs behind a load balancer, anycast, or a DNS strategy), and any shared state (files, databases, caches) must be protected with shared storage, replication, or an application-native clustering model. Cluster safety features such as quorum and STONITH fencing are critical for preventing multiple writers after a partition.

Active-active high availability checklist for Pacemaker:

Confirm the cluster is healthy with quorum before changing resource behavior.

$ pcs status --full
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-01 (1) (version 2.1.6-6fdc9deea29) - partition with quorum
  * Last updated: Wed Dec 31 23:17:25 2025 on node-01
  * Last change:  Wed Dec 31 12:07:51 2025 by root via cibadmin on node-01
  * 3 nodes configured
  * 8 resource instances configured
##### snipped #####
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Active-active changes without quorum and fencing can create split-brain and corrupt shared state.

List services that are safe to run on multiple nodes at the same time.

Stateless frontends (HTTP APIs, reverse proxies, queue consumers) are typical active-active candidates.
Map each candidate service to its state requirements (storage, sessions, locks, leader election) before enabling parallel starts.

Multiple instances writing to the same data without a concurrency model is a common cause of divergent state and corruption.
Plan how client traffic will be distributed across nodes.

Health checks should represent real readiness, and affinity should be planned when session state cannot be shared.

Related: How to plan traffic distribution for active-active PCS services
Choose between cloned resources and promotable resources for each service.

Cloned resources fit parallel stateless instances, while promotable resources fit replicated services that still require controlled writer roles.

Related: How to create a cloned resource in Pacemaker
Related: How to create a promotable resource in Pacemaker
Define clone limits and scheduler behavior for parallel services.

Common guardrails include clone-node-max=1 to prevent double-starts on one node and clone-max to cap total instances.

Related: How to set Pacemaker clone resource options
Tune monitoring and recovery defaults so failures are handled consistently.

Monitor intervals and timeouts should match application startup time and load balancer failover timing to reduce flapping.

Related: How to set Pacemaker resource defaults
Related: How to clear Pacemaker resource failures
Put a node into standby to simulate a controlled removal during maintenance.
```
$ pcs node standby node-02
```
Return the node after testing with pcs node unstandby node-02.

Related: How to put a Pacemaker node in standby mode
Related: How to run a Pacemaker failover test with PCS

Confirm resources relocate away from the standby node.

$ pcs status --full
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-01 (1) (version 2.1.6-6fdc9deea29) - partition with quorum
##### snipped #####
Node List:
  * Node node-01 (1): online, feature set 3.17.4
  * Node node-02 (2): standby, feature set 3.17.4
  * Node node-03 (3): online, feature set 3.17.4
##### snipped #####

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.