Active-active services need a reliable way to steer clients to any healthy node so every instance can carry load without exposing node-level failures to users.
In a pcs-managed cluster, Pacemaker and Corosync keep service instances started, monitored, and recovered on each node, while a separate distribution layer decides which node receives new connections. That distribution layer is usually a load balancer, DNS (round-robin or weighted), or anycast routing, and it should make decisions based on health signals that reflect real readiness.
Distribution design depends on protocol behavior, session state, and operational constraints such as DNS caching, TLS termination, and acceptable failover time. Health checks must be precise enough to avoid blackholing traffic and must support planned draining so nodes can be removed from rotation before maintenance actions change cluster state.
Active-active traffic distribution checklist for PCS services:
- List every client entrypoint and protocol that must reach the active-active service.
Protocol details (HTTPS, TCP, UDP, gRPC, WebSocket) determine whether L4 or L7 distribution features are available.
- Choose a distribution method that fits the protocol and client behavior.
Method Typical fit Practical failover notes Load balancer Most HTTP(S) and TCP/UDP services Fast failover when health checks and timeouts are tuned correctly. DNS round-robin/weighted Simple multi-endpoint services Failover depends on resolver and client caching, not only TTL. Anycast Large-scale TCP/UDP services Fast redirection, but requires routing control and careful operations. - Document how the distribution layer avoids becoming a single point of failure.
Examples include redundant load balancers, multi-resolver DNS setups, or multiple anycast announcers with well-defined withdrawal behavior.
- Define the stable client-facing name or address used for the service entrypoint.
Examples include a single FQDN for a load balancer VIP, multiple A/AAAA records for DNS distribution, or an anycast prefix announced from multiple sites.
- Decide where TLS is terminated and which mechanism preserves client identity to the service nodes.
Common patterns include end-to-end TLS with per-node certificates, TLS termination at the load balancer with X-Forwarded-For or Forwarded headers, or the PROXY protocol for L4 pass-through.
- Define health checks that validate service readiness on every node.
Protocol: HTTP Path: /ready Success: 200 OK Timeout: 2s Interval: 5s Unhealthy: 2 failures
A shallow port check can mark a node as healthy while the application is returning errors, which can blackhole traffic during partial outages.
Use checks that fail quickly when dependencies are unavailable.
- Select health check intervals and failure thresholds that avoid flapping during transient dependency issues.
Match thresholds to startup time, warmup behavior, and dependency retry patterns so nodes re-enter rotation only when stable.
- Decide on session persistence and client affinity requirements.
Stateless services often avoid affinity, while stateful sessions may require stickiness via cookie, source IP, or consistent hashing.
- Document where session state lives and how consistency is maintained across nodes.
Common patterns include a shared database, external session store, distributed cache, or replication that tolerates requests landing on different nodes.
- Align DNS TTLs and drain times with maintenance windows.
DNS TTL is not a hard guarantee; some clients and resolvers cache records longer and can keep connecting to a drained node.
- Document how to remove nodes from rotation before planned work.
Preferred drain order: distribution layer drain → connection drain time → cluster standby or maintenance.
- Define the drain behavior for long-lived connections and streaming protocols.
Examples include graceful connection draining, max-connection-age limits, or readiness switching to non-ready while allowing established sessions to complete.
- Record the expected failover objective and the signals that prove it is being met.
Signals typically include per-node connection counts, error rate, latency, and distribution skew during node loss and recovery.
- Schedule a failure drill that removes one node from rotation and confirms client traffic continues on the remaining nodes.
Include both planned drain and unplanned failure scenarios to validate health checks, timeouts, and operational runbooks.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
