How to monitor shard recovery in Elasticsearch

Monitoring shard recovery in Elasticsearch shows whether replicas, relocated shards, or restored data are actually catching up after node restarts, node loss, disk rebalancing, or snapshot restores. Fast visibility here reduces the risk of assuming a cluster is healthy while data movement is still consuming bandwidth and leaving shards partially protected.

The quickest operator view is /_cat/recovery, which lists active shard copies with their recovery type, current stage, source and target nodes, and progress percentages. For a deeper read, /<index>/_recovery exposes per-shard file, byte, translog, and timing details, while cluster health counters show whether relocation and initialization are still draining across the cluster.

Secured deployments commonly use an authenticated HTTPS endpoint for these curl requests. Elastic's current API docs still position the CAT endpoints as operator-facing views rather than application APIs, and slow or stalled recoveries commonly trace back to throttling, disk pressure, or allocation rules rather than the recovery commands themselves.

Steps to monitor shard recovery in Elasticsearch:

Request a concise cluster-wide summary of relocating, initializing, and unassigned shards.

$ curl -sS --fail "http://localhost:9200/_cluster/health?filter_path=cluster_name,status,number_of_nodes,number_of_data_nodes,relocating_shards,initializing_shards,unassigned_shards,number_of_pending_tasks,active_shards_percent_as_number&pretty"
{
  "cluster_name" : "es-cluster",
  "status" : "yellow",
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "relocating_shards" : 2,
  "initializing_shards" : 1,
  "unassigned_shards" : 3,
  "number_of_pending_tasks" : 0,
  "active_shards_percent_as_number" : 98.8
}

Rising relocating_shards or initializing_shards counts confirm recovery work is still active, while persistent unassigned_shards often point to an allocation problem rather than slow copying alone.

For secured clusters, switch the URL to https:// and add authentication such as --user elastic:password or -H "Authorization: ApiKey BASE64VALUE" when the HTTP endpoint uses a private CA.

Use the CAT health view for a one-line operator snapshot during long recovery windows.

$ curl -sS --fail "http://localhost:9200/_cat/health?v=true&h=cluster,status,node.total,node.data,shards,pri,relo,init,unassign,pending_tasks,active_shards_percent"
cluster    status node.total node.data shards pri relo init unassign pending_tasks active_shards_percent
es-cluster yellow          3         3    256 128    2    1        3             0                 98.8%

Elastic's current CAT health documentation still calls this view useful for tracking recovery over time, but it remains intended for human triage rather than monitoring integrations.

List active shard recoveries with stage and progress columns.

$ curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,type,stage,source_node,target_node,files_percent,bytes_percent,translog_ops_percent&s=index,shard"
index        shard time  type stage    source_node target_node files_percent bytes_percent translog_ops_percent
logs-2026.04 0     01:12 peer index    es-hot-1    es-hot-2    74.6%         68.3%        0.0%
logs-2026.04 1     00:19 peer translog es-hot-1    es-hot-3    100.0%        100.0%       42.7%

The current CAT recovery API still supports active_only=true, so empty output means there are no shard recoveries in progress at that moment.

Refresh the CAT recovery view periodically to watch stage changes and percentage movement in real time.

$ watch -n 2 'curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent,translog_ops_percent&s=index,shard"'
Every 2.0s: curl -sS --fail "http://localhost:9200/_cat/recovery?v&active_only=true&h=index,shard,time,stage,source_node,target_node,bytes_percent,translog_ops_percent&s=index,shard"  Thu Apr  2 16:30:01 2026

index        shard time  stage    source_node target_node bytes_percent translog_ops_percent
logs-2026.04 0     01:14 index    es-hot-1    es-hot-2    71.1%        0.0%
logs-2026.04 1     00:21 translog es-hot-1    es-hot-3    100.0%       48.9%

Press Ctrl+C to stop watch. When percentages stay flat for multiple refreshes, compare the affected shards against throttling, disk, and allocation signals before forcing extra movement.

Inspect detailed recovery metrics for the affected index when the fast view is not enough.

$ curl -sS --fail "http://localhost:9200/logs-2026.04/_recovery?active_only=true&detailed=true&human&pretty"
{
  "logs-2026.04" : {
    "shards" : [
      {
        "id" : 0,
        "type" : "PEER",
        "stage" : "TRANSLOG",
        "primary" : false,
        "source" : {
          "name" : "es-hot-1"
        },
        "target" : {
          "name" : "es-hot-2"
        },
        "index" : {
          "files" : {
            "total" : 152,
            "recovered" : 152,
            "percent" : "100.0%"
          },
          "size" : {
            "total" : "100mb",
            "total_in_bytes" : 104857600,
            "recovered" : "100mb",
            "recovered_in_bytes" : 104857600,
            "percent" : "100.0%"
          },
          "source_throttle_time" : "0s",
          "target_throttle_time" : "0s"
        },
        "translog" : {
          "recovered" : 489,
          "total" : 1000,
          "percent" : "48.9%",
          "total_time" : "4.2s"
        },
        "verify_index" : {
          "check_index_time" : "0s"
        }
      }
##### snipped #####
    ]
  }
}

Elastic's current index recovery API still supports active_only=true and detailed=true together. INDEX, VERIFY_INDEX, TRANSLOG, and DONE stages help separate file copy, validation, replay, and completion.

Check the affected shards for their final state and any unassigned reason that still needs attention.

$ curl -sS --fail "http://localhost:9200/_cat/shards/logs-2026.04?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=shard,prirep"
index        shard prirep state   node      unassigned.reason
logs-2026.04 0     p      STARTED es-hot-1
logs-2026.04 0     r      STARTED es-hot-2
logs-2026.04 1     p      STARTED es-hot-1
logs-2026.04 1     r      STARTED es-hot-3

The current CAT shards API still supports the unassigned.reason column. Focus first on replicas or primaries that remain UNASSIGNED, INITIALIZING, or RELOCATING after the rest of the index has settled.

Wait for the cluster to report no relocating or initializing shards before treating recovery as finished.

$ curl -sS --fail "http://localhost:9200/_cluster/health?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=60s&filter_path=cluster_name,status,relocating_shards,initializing_shards,unassigned_shards,active_shards_percent_as_number,timed_out&pretty"
{
  "cluster_name" : "es-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 2,
  "active_shards_percent_as_number" : 99.2
}

Current cluster health parameters still support waiting for zero relocating and initializing shards. A yellow result can still be acceptable when recovery is complete but one or more replicas remain intentionally unavailable or cannot be placed.