Monitoring shard recovery in Elasticsearch exposes how quickly data is copied and replayed after node loss, restarts, snapshot restores, or shard rebalancing, keeping performance surprises and prolonged risk windows to a minimum.
Shard recovery progresses through phases that copy segment files and catch up recent operations, typically moving from index to translog before finalizing. The compact /_cat/recovery view is designed for fast, sortable snapshots of what is currently moving, while /<index>/_recovery provides per-shard details that pinpoint whether file transfer, checksum verification, or translog replay is the bottleneck.
Requests shown use http://localhost:9200 as an example endpoint and may require HTTPS plus authentication in secured clusters. Slow recoveries are commonly caused by recovery throttling (for example indices.recovery.max_bytes_per_sec), disk saturation, or noisy neighbors on the same nodes, and forcing extra relocations during recovery can amplify the load.
Steps to monitor shard recovery in Elasticsearch:
- Check cluster health for relocating, initializing, and unassigned shards.
$ curl -s "http://localhost:9200/_cluster/health?pretty" { "cluster_name" : "es-cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 128, "active_shards" : 253, "relocating_shards" : 2, "initializing_shards" : 1, "unassigned_shards" : 3, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 98.8 }Secured clusters may require HTTPS, credentials or an API key, and a CA certificate for curl.
- List active shard recoveries with stage and progress columns.
$ curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_node,target_node,files_percent,bytes_percent,translog_ops_percent&s=index,shard" index shard time type stage source_node target_node files_percent bytes_percent translog_ops_percent logs-2025.01 0 00:18 peer index es-hot-1 es-hot-2 73.2% 68.4% 0.0% logs-2025.01 1 00:07 peer translog es-hot-1 es-hot-3 100.0% 100.0% 42.7%
Empty output indicates there are no active recoveries.
- Refresh the recovery view periodically to track live progress.
$ watch -n 2 'curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent&s=index,shard"' Every 2.0s: curl -s "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent&s=index,shard" Fri Jan 2 10:15:01 2026 index shard time stage source_node target_node bytes_percent logs-2025.01 0 00:20 translog es-hot-1 es-hot-2 55.9%
Press Ctrl+C to exit watch.
- Inspect detailed recovery metrics for a specific index.
$ curl -s "http://localhost:9200/logs-2025.01/_recovery?active_only=true&detailed=true&pretty" { "logs-2025.01" : { "shards" : [ { "id" : 0, "type" : "PEER", "stage" : "TRANSLOG", "primary" : false, "source" : { "name" : "es-hot-1" }, "target" : { "name" : "es-hot-2" }, "index" : { "files" : { "total" : 152, "recovered" : 152, "percent" : "100.0%" }, "size" : { "total_in_bytes" : 104857600, "recovered_in_bytes" : 58720256, "percent" : "56.0%" } }, "translog" : { "recovered" : 427, "total" : 1000, "percent" : "42.7%" } } ##### snipped ##### ] } } - Confirm shards for the affected index return to STARTED state.
$ curl -s "http://localhost:9200/_cat/shards/logs-2025.01?v&h=index,shard,prirep,state,docs,store,node&s=shard,prirep" index shard prirep state docs store node logs-2025.01 0 p STARTED 9412 102.3mb es-hot-1 logs-2025.01 0 r STARTED 9412 102.3mb es-hot-2 logs-2025.01 1 p STARTED 9388 101.9mb es-hot-1 logs-2025.01 1 r STARTED 9388 101.9mb es-hot-3
States like INITIALIZING and RELOCATING indicate recovery is still in progress.
- Wait until the cluster reports zero relocating and initializing shards.
$ curl -s "http://localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=60s&pretty" { "cluster_name" : "es-cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 128, "active_shards" : 256, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 99.2 }Cluster status can remain yellow with missing replicas even when recovery and relocation have finished.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
