How to monitor GlusterFS health

Monitoring GlusterFS health keeps distributed volumes available and catches early warning signs such as disconnected peers, offline bricks, or a growing heal backlog before client workloads start timing out or returning I/O errors.

A GlusterFS cluster forms a trusted pool of peers and serves data from brick directories grouped into volumes. Health signals are exposed through the gluster CLI by checking peer connectivity, brick process state, and background activity such as self-heal and rebalance.

Checks differ by volume type and features in use: replica and disperse volumes depend heavily on heal and split-brain state, while distributed layouts focus on brick availability and capacity. Treat persistent non-zero heal entries, any split-brain listings, repeated errors in /var/log/glusterfs, and unhealthy geo-replication sessions as incidents rather than “noise”.

GlusterFS monitoring checklist:

Check peer connectivity across the trusted pool.

$ sudo gluster peer status
Number of Peers: 2

Hostname: node2
Uuid: 6770f88c-9ec5-4cf8-b9f5-658fa17b6bdc
State: Peer in Cluster (Connected)

Hostname: node3
Uuid: 5a3c65f3-1b4d-4d6e-93d4-4c24f0b6b5bf
State: Peer in Cluster (Connected)

Peer in Cluster (Connected) indicates the peer is reachable and participating in the cluster.

Review volume status for brick health.

$ sudo gluster volume status volume1
Status of volume: volume1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/srv/gluster/brick1             49152     0          Y       2143
Brick node2:/srv/gluster/brick1             49152     0          Y       2311
Self-heal Daemon on node1                   N/A       N/A        Y       2202
Self-heal Daemon on node2                   N/A       N/A        Y       2370

Replace volume1 with the target volume name, and treat any Online value of N as a service-impacting fault.

Check brick filesystems for free space and inode usage.

$ df -h /srv/gluster/brick1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       1.8T  1.1T  640G  64% /srv/gluster/brick1

$ df -i /srv/gluster/brick1
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
/dev/sdb1     122093568 512340 121581228    1% /srv/gluster/brick1

Bricks that reach 100% space or inodes can trigger client write failures and may block heal or rebalance progress.

Inspect heal activity for split-brain indicators.
```
$ sudo gluster volume heal volume1 info summary
Brick node1:/srv/gluster/brick1
Status: Connected
Number of entries: 0

Brick node2:/srv/gluster/brick1
Status: Connected
Number of entries: 0

$ sudo gluster volume heal volume1 info split-brain
Brick node1:/srv/gluster/brick1
Number of entries: 0

Brick node2:/srv/gluster/brick1
Number of entries: 0
```
Heal checks are most relevant for replica and disperse volumes; a sustained non-zero count usually means the cluster is still converging or is repeatedly failing to heal.

Any split-brain entry indicates diverged file versions across bricks, and leaving it unresolved risks serving inconsistent data to clients.

Related: How to heal a GlusterFS volume
Related: How to check for split-brain in GlusterFS

Track rebalance activity after brick changes.

$ sudo gluster volume rebalance volume1 status
                                   Node  Rebalanced-files      size       scanned   failures   status
                              ---------  ----------------  ---------  ----------  ---------  ---------
                                   node1                 0        0B           0          0  completed
                                   node2                 0        0B           0          0  completed

Rebalance is common after adding or removing bricks on distributed layouts, and long runtimes usually correlate with the amount of data to migrate.

Review GlusterFS logs for errors and warnings.

$ sudo tail -n 20 /var/log/glusterfs/glusterd.log
[2025-05-13 10:31:08.912345 +0000] I [MSGID: 106487] [glusterd.c:1960:glusterd_init] 0-management: Glusterd started successfully
[2025-05-13 10:33:41.104882 +0000] W [MSGID: 100030] [rpc-clnt.c:735:rpc_clnt_handle_disconnect] 0-rpc: disconnecting from peer node2
##### snipped #####

Cluster-wide logs are commonly under /var/log/glusterfs, with brick-specific logs typically under /var/log/glusterfs/bricks.

Related: How to check GlusterFS logs

Check geo-replication status when secondary replication is enabled.

Geo-replication is asynchronous, so a stopped or faulty session can silently leave the secondary behind even when the primary volume looks healthy.

Related: How to check GlusterFS geo-replication status

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.