How to set up RabbitMQ high availability with PCS

RabbitMQ outages tend to show up as stuck jobs, stalled notifications, and microservices waiting for a message that never arrives. A Pacemaker-managed floating IP keeps a single broker endpoint reachable during node failures, limiting downtime to the time it takes to relocate the VIP and start RabbitMQ.

The pcs CLI configures Pacemaker resources for an IPaddr2 virtual IP and the rabbitmq-server.service systemd unit. A resource group applies ordering and colocation so the VIP comes up first, the broker follows, and both move together during failover.

This pattern is active/passive: client connections reset during failover, so applications need reconnect logic and reasonable timeouts. Message safety depends on durable queues, persistent messages, and broker state being available on the failover node (for example via a replicated /var/lib/rabbitmq data directory); otherwise failover can be fast and empty-handed. Plan a maintenance window for the cutover and validate the move behavior with a controlled test before production traffic depends on the VIP.

Steps to set up RabbitMQ high availability with PCS:

Confirm the cluster reports quorum.

$ sudo pcs status
Cluster name: clustername
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: node-01 (version 2.1.6-6fdc9deea29) - partition with quorum
  * 3 nodes configured
  * 0 resource instances configured
##### snipped #####

Confirm the RabbitMQ systemd unit exists on every node.

$ systemctl list-unit-files --type=service | grep -E '^rabbitmq-server\.service'
rabbitmq-server.service                      enabled         enabled

The unit name must match the value used in the systemd:rabbitmq-server resource.

Disable automatic startup of RabbitMQ outside Pacemaker on all cluster nodes.

$ sudo systemctl disable --now rabbitmq-server
Synchronizing state of rabbitmq-server.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install disable rabbitmq-server
Removed "/etc/systemd/system/multi-user.target.wants/rabbitmq-server.service".
##### snipped #####

Existing broker connections drop when the service stops.

Create a floating IP resource for the broker endpoint.
```
$ sudo pcs resource create rabbitmq_ip ocf:heartbeat:IPaddr2 ip=192.0.2.64 cidr_netmask=24 op monitor interval=30s
```
Replace the ip= and cidr_netmask= values with an unused address on the client subnet.

Related: How to create a floating IP address in Pacemaker

Create the RabbitMQ systemd resource.

$ sudo pcs resource create rabbitmq_service systemd:rabbitmq-server op start timeout=180s op stop timeout=120s op monitor interval=30s

Group the VIP with the RabbitMQ resource into a single failover unit.
```
$ sudo pcs resource group add rabbitmq-stack rabbitmq_ip rabbitmq_service
```
Related: How to create a Pacemaker resource group

Verify the resource group placement.

$ sudo pcs status resources
  * Resource Group: rabbitmq-stack:
    * rabbitmq_ip	(ocf:heartbeat:IPaddr2):	 Started node-01
    * rabbitmq_service	(systemd:rabbitmq-server):	 Started node-01

Verify the floating IP is assigned on the node running the group.

$ ip -4 address show | grep -F '192.0.2.64'
    inet 192.0.2.64/24 brd 192.0.2.255 scope global secondary eth0

Move the resource group to another node to test failover.
```
$ sudo pcs resource move rabbitmq-stack node-02
```
Clients connected to the VIP disconnect during the move.

Related: How to run a Pacemaker failover test with PCS

Verify the resource group is started on the target node.

$ sudo pcs status resources
  * Resource Group: rabbitmq-stack:
    * rabbitmq_ip	(ocf:heartbeat:IPaddr2):	 Started node-02
    * rabbitmq_service	(systemd:rabbitmq-server):	 Started node-02

Clear the temporary move constraint to finish the test.
```
$ sudo pcs resource clear rabbitmq-stack
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.