Monitoring GlusterFS health keeps distributed volumes available and catches early warning signs such as disconnected peers, offline bricks, or a growing heal backlog before client workloads start timing out or returning I/O errors.

A GlusterFS cluster forms a trusted pool of peers and serves data from brick directories grouped into volumes. Health signals are exposed through the gluster CLI by checking peer connectivity, brick process state, and background activity such as self-heal and rebalance.

Checks differ by volume type and features in use: replica and disperse volumes depend heavily on heal and split-brain state, while distributed layouts focus on brick availability and capacity. Treat persistent non-zero heal entries, any split-brain listings, repeated errors in /var/log/glusterfs, and unhealthy geo-replication sessions as incidents rather than “noise”.

GlusterFS monitoring checklist:

  1. Check peer connectivity across the trusted pool.
    $ sudo gluster peer status
    Number of Peers: 2
    
    Hostname: node2
    Uuid: 6770f88c-9ec5-4cf8-b9f5-658fa17b6bdc
    State: Peer in Cluster (Connected)
    
    Hostname: node3
    Uuid: 5a3c65f3-1b4d-4d6e-93d4-4c24f0b6b5bf
    State: Peer in Cluster (Connected)

    Peer in Cluster (Connected) indicates the peer is reachable and participating in the cluster.

  2. Review volume status for brick health.
    $ sudo gluster volume status volume1
    Status of volume: volume1
    Gluster process                             TCP Port  RDMA Port  Online  Pid
    ------------------------------------------------------------------------------
    Brick node1:/srv/gluster/brick1             49152     0          Y       2143
    Brick node2:/srv/gluster/brick1             49152     0          Y       2311
    Self-heal Daemon on node1                   N/A       N/A        Y       2202
    Self-heal Daemon on node2                   N/A       N/A        Y       2370

    Replace volume1 with the target volume name, and treat any Online value of N as a service-impacting fault.

  3. Check brick filesystems for free space and inode usage.
    $ df -h /srv/gluster/brick1
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sdb1       1.8T  1.1T  640G  64% /srv/gluster/brick1
    
    $ df -i /srv/gluster/brick1
    Filesystem       Inodes   IUsed    IFree IUse% Mounted on
    /dev/sdb1     122093568 512340 121581228    1% /srv/gluster/brick1

    Bricks that reach 100% space or inodes can trigger client write failures and may block heal or rebalance progress.

  4. Inspect heal activity for split-brain indicators.
    $ sudo gluster volume heal volume1 info summary
    Brick node1:/srv/gluster/brick1
    Status: Connected
    Number of entries: 0
    
    Brick node2:/srv/gluster/brick1
    Status: Connected
    Number of entries: 0
    
    $ sudo gluster volume heal volume1 info split-brain
    Brick node1:/srv/gluster/brick1
    Number of entries: 0
    
    Brick node2:/srv/gluster/brick1
    Number of entries: 0

    Heal checks are most relevant for replica and disperse volumes; a sustained non-zero count usually means the cluster is still converging or is repeatedly failing to heal.

    Any split-brain entry indicates diverged file versions across bricks, and leaving it unresolved risks serving inconsistent data to clients.

  5. Track rebalance activity after brick changes.
    $ sudo gluster volume rebalance volume1 status
                                       Node  Rebalanced-files      size       scanned   failures   status
                                  ---------  ----------------  ---------  ----------  ---------  ---------
                                       node1                 0        0B           0          0  completed
                                       node2                 0        0B           0          0  completed

    Rebalance is common after adding or removing bricks on distributed layouts, and long runtimes usually correlate with the amount of data to migrate.

  6. Review GlusterFS logs for errors and warnings.
    $ sudo tail -n 20 /var/log/glusterfs/glusterd.log
    [2025-05-13 10:31:08.912345 +0000] I [MSGID: 106487] [glusterd.c:1960:glusterd_init] 0-management: Glusterd started successfully
    [2025-05-13 10:33:41.104882 +0000] W [MSGID: 100030] [rpc-clnt.c:735:rpc_clnt_handle_disconnect] 0-rpc: disconnecting from peer node2
    ##### snipped #####

    Cluster-wide logs are commonly under /var/log/glusterfs, with brick-specific logs typically under /var/log/glusterfs/bricks.

  7. Check geo-replication status when secondary replication is enabled.

    Geo-replication is asynchronous, so a stopped or faulty session can silently leave the secondary behind even when the primary volume looks healthy.