Monitoring shard recovery in Elasticsearch shows whether replicas, relocated shards, or restored data are actually catching up after node restarts, node loss, disk rebalancing, or snapshot restores. Fast visibility here reduces the risk of assuming a cluster is healthy while data movement is still consuming bandwidth and leaving shards partially protected.
The quickest operator view is /_cat/recovery, which lists active shard copies with their recovery type, current stage, source and target nodes, and progress percentages. For a deeper read, /<index>/_recovery exposes per-shard file, byte, translog, and timing details, while cluster health counters show whether relocation and initialization are still draining across the cluster.
Secured deployments commonly use an authenticated HTTPS endpoint for these curl requests. Elastic's current API docs still position the CAT endpoints as operator-facing views rather than application APIs, and slow or stalled recoveries commonly trace back to throttling, disk pressure, or allocation rules rather than the recovery commands themselves.
Steps to monitor shard recovery in Elasticsearch:
- Request a concise cluster-wide summary of relocating, initializing, and unassigned shards.
$ curl -sS --fail "http://localhost:9200/_cluster/health?filter_path=cluster_name,status,number_of_nodes,number_of_data_nodes,relocating_shards,initializing_shards,unassigned_shards,number_of_pending_tasks,active_shards_percent_as_number&pretty" { "cluster_name" : "es-cluster", "status" : "yellow", "number_of_nodes" : 3, "number_of_data_nodes" : 3, "relocating_shards" : 2, "initializing_shards" : 1, "unassigned_shards" : 3, "number_of_pending_tasks" : 0, "active_shards_percent_as_number" : 98.8 }Rising relocating_shards or initializing_shards counts confirm recovery work is still active, while persistent unassigned_shards often point to an allocation problem rather than slow copying alone.
For secured clusters, switch the URL to https:// and add authentication such as --user elastic:password or -H "Authorization: ApiKey BASE64VALUE" when the HTTP endpoint uses a private CA.
- Use the CAT health view for a one-line operator snapshot during long recovery windows.
$ curl -sS --fail "http://localhost:9200/_cat/health?v=true&h=cluster,status,node.total,node.data,shards,pri,relo,init,unassign,pending_tasks,active_shards_percent" cluster status node.total node.data shards pri relo init unassign pending_tasks active_shards_percent es-cluster yellow 3 3 256 128 2 1 3 0 98.8%
Elastic's current CAT health documentation still calls this view useful for tracking recovery over time, but it remains intended for human triage rather than monitoring integrations.
- List active shard recoveries with stage and progress columns.
$ curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_node,target_node,files_percent,bytes_percent,translog_ops_percent&s=index,shard" index shard time type stage source_node target_node files_percent bytes_percent translog_ops_percent logs-2026.04 0 01:12 peer index es-hot-1 es-hot-2 74.6% 68.3% 0.0% logs-2026.04 1 00:19 peer translog es-hot-1 es-hot-3 100.0% 100.0% 42.7%
The current CAT recovery API still supports active_only=true, so empty output means there are no shard recoveries in progress at that moment.
- Refresh the CAT recovery view periodically to watch stage changes and percentage movement in real time.
$ watch -n 2 'curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent,translog_ops_percent&s=index,shard"' Every 2.0s: curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent,translog_ops_percent&s=index,shard" Thu Apr 2 16:30:01 2026 index shard time stage source_node target_node bytes_percent translog_ops_percent logs-2026.04 0 01:14 index es-hot-1 es-hot-2 71.1% 0.0% logs-2026.04 1 00:21 translog es-hot-1 es-hot-3 100.0% 48.9%
Press Ctrl+C to stop watch. When percentages stay flat for multiple refreshes, compare the affected shards against throttling, disk, and allocation signals before forcing extra movement.
- Inspect detailed recovery metrics for the affected index when the fast view is not enough.
$ curl -sS --fail "http://localhost:9200/logs-2026.04/_recovery?active_only=true&detailed=true&human&pretty" { "logs-2026.04" : { "shards" : [ { "id" : 0, "type" : "PEER", "stage" : "TRANSLOG", "primary" : false, "source" : { "name" : "es-hot-1" }, "target" : { "name" : "es-hot-2" }, "index" : { "files" : { "total" : 152, "recovered" : 152, "percent" : "100.0%" }, "size" : { "total" : "100mb", "total_in_bytes" : 104857600, "recovered" : "100mb", "recovered_in_bytes" : 104857600, "percent" : "100.0%" }, "source_throttle_time" : "0s", "target_throttle_time" : "0s" }, "translog" : { "recovered" : 489, "total" : 1000, "percent" : "48.9%", "total_time" : "4.2s" }, "verify_index" : { "check_index_time" : "0s" } } ##### snipped ##### ] } }Elastic's current index recovery API still supports active_only=true and detailed=true together. INDEX, VERIFY_INDEX, TRANSLOG, and DONE stages help separate file copy, validation, replay, and completion.
- Check the affected shards for their final state and any unassigned reason that still needs attention.
$ curl -sS --fail "http://localhost:9200/_cat/shards/logs-2026.04?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=shard,prirep" index shard prirep state node unassigned.reason logs-2026.04 0 p STARTED es-hot-1 logs-2026.04 0 r STARTED es-hot-2 logs-2026.04 1 p STARTED es-hot-1 logs-2026.04 1 r STARTED es-hot-3
The current CAT shards API still supports the unassigned.reason column. Focus first on replicas or primaries that remain UNASSIGNED, INITIALIZING, or RELOCATING after the rest of the index has settled.
- Wait for the cluster to report no relocating or initializing shards before treating recovery as finished.
$ curl -sS --fail "http://localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=60s&filter_path=cluster_name,status,relocating_shards,initializing_shards,unassigned_shards,active_shards_percent_as_number,timed_out&pretty" { "cluster_name" : "es-cluster", "status" : "yellow", "timed_out" : false, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 2, "active_shards_percent_as_number" : 99.2 }Current cluster health parameters still support waiting for zero relocating and initializing shards. A yellow result can still be acceptable when recovery is complete but one or more replicas remain intentionally unavailable or cannot be placed.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.
