Scaling the LGTM stack for high availability means giving Grafana, Loki, Tempo, and Mimir enough replicas and placement rules to survive routine pod or node loss. A production stack should keep ingesting and querying telemetry while one pod is restarted or one node is drained.

Backend availability depends on more than replica counts. Loki, Tempo, and Mimir need object storage, member or hash rings, anti-affinity, and resource requests that match the deployment mode selected in their Helm charts.

Validate scaling changes in staging before applying them to production. A useful HA check compares readiness and smoke queries before and after restarting one non-critical pod, then confirms the component returns to the expected replica count.

Steps to scale the LGTM stack for high availability:

  1. Check current pod placement.
    $ kubectl get pods --namespace monitoring -o wide
    NAME                                  READY   STATUS    NODE
    grafana-7f9df8f8c7-n2c6x             1/1     Running   worker-a
    loki-write-0                         2/2     Running   worker-a
    loki-write-1                         2/2     Running   worker-b
    mimir-ingester-zone-a-0              1/1     Running   worker-a
    mimir-ingester-zone-b-0              1/1     Running   worker-b
    ##### snipped #####
  2. Set production replica counts in the stack values.
    values-ha.yaml
    grafana:
      replicas: 2
    
    loki:
      deploymentMode: SimpleScalable
      read:
        replicas: 3
      write:
        replicas: 3
      backend:
        replicas: 3
    
    tempo:
      distributor:
        replicas: 2
      ingester:
        replicas: 3
      querier:
        replicas: 2
      queryFrontend:
        replicas: 2
    
    mimir:
      distributor:
        replicas: 2
      ingester:
        replicas: 3
      querier:
        replicas: 2
      query_frontend:
        replicas: 2
      store_gateway:
        replicas: 3

    Use the keys supported by the chart versions installed in the cluster. Render the chart before applying the values.

  3. Add pod anti-affinity or topology spread settings.
    values-spread.yaml
    global:
      podAntiAffinity:
        enabled: true
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
  4. Render the chart values before changing the cluster.
    $ helm template loki grafana/loki \
      --namespace monitoring \
      --values values/loki.yaml \
      --values values-ha.yaml
    ##### snipped #####
    kind: StatefulSet
    metadata:
      name: loki-write
  5. Apply the HA values to each release.
    $ helm upgrade --install loki grafana/loki \
      --namespace monitoring \
      --values values/loki.yaml \
      --values values-ha.yaml \
      --wait --timeout 15m
    Release "loki" has been upgraded. Happy Helming!
  6. Wait for the namespace to return to ready state.
    $ kubectl wait --namespace monitoring \
      --for=condition=Ready pod \
      --all --timeout=10m
    pod/grafana-7f9df8f8c7-n2c6x condition met
    pod/loki-write-0 condition met
    ##### snipped #####
  7. Check that scaled components have the expected replica counts.
    $ kubectl get statefulset,deployment --namespace monitoring
    NAME                            READY
    statefulset.apps/loki-write     3/3
    statefulset.apps/loki-backend   3/3
    statefulset.apps/mimir-ingester 3/3
    deployment.apps/grafana         2/2
  8. Run a smoke query before restarting a pod.
    $ curl --silent https://grafana.example.com/api/health
    {"database":"ok","version":"13.0.1"}
  9. Restart one backend pod through the owning controller.
    $ kubectl rollout restart statefulset/loki-write \
      --namespace monitoring
    statefulset.apps/loki-write restarted

    Run disruption tests in staging first. Do not delete multiple ring members at once unless the component documentation and current ring state support it.

  10. Confirm readiness returns after the restart.
    $ kubectl rollout status statefulset/loki-write \
      --namespace monitoring --timeout=10m
    statefulset rolling update complete 3 pods at revision loki-write-7c7d9d
  11. Run the same smoke query after the restart.
    $ curl --silent --get https://logs.example.com/loki/api/v1/query_range \
      --data-urlencode 'query={service_name="checkout-api"}'
    {"status":"success","data":{"resultType":"streams","result":[]}}