Active-active services need a reliable way to steer clients to any healthy node so every instance can carry load without exposing node-level failures to users.
In a pcs-managed cluster, Pacemaker and Corosync keep service instances started, monitored, and recovered on each node, while a separate distribution layer decides which node receives new connections. That distribution layer is usually a load balancer, DNS (round-robin or weighted), or anycast routing, and it should make decisions based on health signals that reflect real readiness.
Distribution design depends on protocol behavior, session state, and operational constraints such as DNS caching, TLS termination, and acceptable failover time. Health checks must be precise enough to avoid blackholing traffic and must support planned draining so nodes can be removed from rotation before maintenance actions change cluster state.
Protocol details (HTTPS, TCP, UDP, gRPC, WebSocket) determine whether L4 or L7 distribution features are available.
| Method | Typical fit | Practical failover notes |
|---|---|---|
| Load balancer | Most HTTP(S) and TCP/UDP services | Fast failover when health checks and timeouts are tuned correctly. |
| DNS round-robin/weighted | Simple multi-endpoint services | Failover depends on resolver and client caching, not only TTL. |
| Anycast | Large-scale TCP/UDP services | Fast redirection, but requires routing control and careful operations. |
Examples include redundant load balancers, multi-resolver DNS setups, or multiple anycast announcers with well-defined withdrawal behavior.
Examples include a single FQDN for a load balancer VIP, multiple A/AAAA records for DNS distribution, or an anycast prefix announced from multiple sites.
Common patterns include end-to-end TLS with per-node certificates, TLS termination at the load balancer with X-Forwarded-For or Forwarded headers, or the PROXY protocol for L4 pass-through.
Protocol: HTTP Path: /ready Success: 200 OK Timeout: 2s Interval: 5s Unhealthy: 2 failures
A shallow port check can mark a node as healthy while the application is returning errors, which can blackhole traffic during partial outages.
Use checks that fail quickly when dependencies are unavailable.
Match thresholds to startup time, warmup behavior, and dependency retry patterns so nodes re-enter rotation only when stable.
Stateless services often avoid affinity, while stateful sessions may require stickiness via cookie, source IP, or consistent hashing.
Common patterns include a shared database, external session store, distributed cache, or replication that tolerates requests landing on different nodes.
DNS TTL is not a hard guarantee; some clients and resolvers cache records longer and can keep connecting to a drained node.
Preferred drain order: distribution layer drain → connection drain time → cluster standby or maintenance.
Examples include graceful connection draining, max-connection-age limits, or readiness switching to non-ready while allowing established sessions to complete.
Signals typically include per-node connection counts, error rate, latency, and distribution skew during node loss and recovery.
Include both planned drain and unplanned failure scenarios to validate health checks, timeouts, and operational runbooks.