How to monitor shard recovery in Elasticsearch

Monitoring shard recovery in Elasticsearch exposes how quickly data is copied and replayed after node loss, restarts, snapshot restores, or shard rebalancing, keeping performance surprises and prolonged risk windows to a minimum.

Shard recovery progresses through phases that copy segment files and catch up recent operations, typically moving from index to translog before finalizing. The compact /_cat/recovery view is designed for fast, sortable snapshots of what is currently moving, while /<index>/_recovery provides per-shard details that pinpoint whether file transfer, checksum verification, or translog replay is the bottleneck.

Requests shown use http://localhost:9200 as an example endpoint and may require HTTPS plus authentication in secured clusters. Slow recoveries are commonly caused by recovery throttling (for example indices.recovery.max_bytes_per_sec), disk saturation, or noisy neighbors on the same nodes, and forcing extra relocations during recovery can amplify the load.

Steps to monitor shard recovery in Elasticsearch:

Check cluster health for relocating, initializing, and unassigned shards.

$ curl -s "http://localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "es-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 128,
  "active_shards" : 253,
  "relocating_shards" : 2,
  "initializing_shards" : 1,
  "unassigned_shards" : 3,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.8
}

Secured clusters may require HTTPS, credentials or an API key, and a CA certificate for curl.

List active shard recoveries with stage and progress columns.

$ curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_node,target_node,files_percent,bytes_percent,translog_ops_percent&s=index,shard"
index        shard time  type  stage    source_node target_node files_percent bytes_percent translog_ops_percent
logs-2025.01 0     00:18 peer  index    es-hot-1    es-hot-2    73.2%         68.4%        0.0%
logs-2025.01 1     00:07 peer  translog es-hot-1    es-hot-3    100.0%        100.0%       42.7%

Empty output indicates there are no active recoveries.

Refresh the recovery view periodically to track live progress.

$ watch -n 2 'curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent&s=index,shard"'
Every 2.0s: curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent&s=index,shard"  Fri Jan  2 10:15:01 2026

index        shard time  stage    source_node target_node bytes_percent
logs-2025.01 0     00:20 translog es-hot-1    es-hot-2    55.9%

Press Ctrl+C to exit watch.

Inspect detailed recovery metrics for a specific index.

$ curl -s "http://localhost:9200/logs-2025.01/_recovery?active_only=true&detailed=true&pretty"
{
  "logs-2025.01" : {
    "shards" : [
      {
        "id" : 0,
        "type" : "PEER",
        "stage" : "TRANSLOG",
        "primary" : false,
        "source" : {
          "name" : "es-hot-1"
        },
        "target" : {
          "name" : "es-hot-2"
        },
        "index" : {
          "files" : {
            "total" : 152,
            "recovered" : 152,
            "percent" : "100.0%"
          },
          "size" : {
            "total_in_bytes" : 104857600,
            "recovered_in_bytes" : 58720256,
            "percent" : "56.0%"
          }
        },
        "translog" : {
          "recovered" : 427,
          "total" : 1000,
          "percent" : "42.7%"
        }
      }
##### snipped #####
    ]
  }
}

Confirm shards for the affected index return to STARTED state.

$ curl -s "http://localhost:9200/_cat/shards/logs-2025.01?v&h=index,shard,prirep,state,docs,store,node&s=shard,prirep"
index        shard prirep state   docs store   node
logs-2025.01 0     p      STARTED 9412 102.3mb es-hot-1
logs-2025.01 0     r      STARTED 9412 102.3mb es-hot-2
logs-2025.01 1     p      STARTED 9388 101.9mb es-hot-1
logs-2025.01 1     r      STARTED 9388 101.9mb es-hot-3

States like INITIALIZING and RELOCATING indicate recovery is still in progress.

Wait until the cluster reports zero relocating and initializing shards.

$ curl -s "http://localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=60s&pretty"
{
  "cluster_name" : "es-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 128,
  "active_shards" : 256,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 2,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.2
}

Cluster status can remain yellow with missing replicas even when recovery and relocation have finished.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.