How to monitor the LGTM stack itself

Monitoring the LGTM stack itself makes the observability system visible when ingestion, storage, queries, or alerting degrade. Grafana, Loki, Tempo, and Mimir expose metrics that should be scraped, queried, dashboarded, and alerted like any other production service.

Self-monitoring should include both component metrics and synthetic telemetry. Component metrics show process health, rings, queues, and object storage errors, while synthetic logs, traces, and metrics prove that the end-to-end signal paths still work.

Keep the alerts focused on failures that require operator action. Readiness loss, failed remote write, object storage errors, ingestion drops, query failures, and notification delivery failures are stronger alert signals than raw pod restarts by themselves.

Steps to monitor the LGTM stack itself:

Confirm the monitoring namespace exposes metrics endpoints.

$ kubectl get servicemonitor,podmonitor --namespace monitoring
NAME                                         AGE
servicemonitor.monitoring.coreos.com/grafana 12d
servicemonitor.monitoring.coreos.com/loki    12d
servicemonitor.monitoring.coreos.com/tempo   12d
servicemonitor.monitoring.coreos.com/mimir   12d

If the cluster does not use the Prometheus Operator, configure the equivalent scrape jobs in the active metrics collector.

Query component target health from the metrics backend.

$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
  --data-urlencode 'query=up{namespace="monitoring"}'
{"status":"success","data":{"resultType":"vector","result":[]}}

Check Loki ingestion metrics.

$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
  --data-urlencode 'query=sum(rate(loki_distributor_lines_received_total[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}

Check Tempo ingestion metrics.

$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
  --data-urlencode 'query=sum(rate(tempo_receiver_accepted_spans[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}

Check Mimir ingestion metrics.

$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
  --data-urlencode 'query=sum(rate(cortex_distributor_received_samples_total[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}

Mimir exposes many metrics with cortex_ prefixes for compatibility. Confirm exact metric names in the running version before final alert rollout.

Add alert rules for backend readiness and failed ingestion.

lgtm-alerts.yaml

apiVersion: 1
groups:
  - orgId: 1
    name: lgtm-stack-health
    folder: Observability
    interval: 1m
    rules:
      - uid: loki-no-ingest
        title: Loki is not receiving log lines
        condition: A
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: mimir
            model:
              expr: sum(rate(loki_distributor_lines_received_total[5m])) == 0

Provision the alert rules with the Grafana release.

$ helm upgrade --install grafana grafana/grafana \
  --namespace monitoring \
  --values values/grafana.yaml \
  --values values/lgtm-alerts.yaml \
  --wait
Release "grafana" has been upgraded. Happy Helming!

Confirm Grafana loaded the alert rule.

$ curl --silent --user admin:<password> \
  https://grafana.example.com/api/v1/provisioning/alert-rules
##### snipped #####
"title":"Loki is not receiving log lines"

Send synthetic telemetry through the production ingestion path.
```
$ curl --silent --include https://otel.example.com/v1/traces
HTTP/2 405
content-type: text/plain
```
Use the organization's normal synthetic telemetry source when one exists. A GET check only proves the receiver is reachable; a complete check should write and query a known signal.

Check the stack dashboard after the synthetic run.

$ curl --silent --user admin:<password> \
  https://grafana.example.com/api/search?query=LGTM
##### snipped #####
"title":"LGTM Stack Health"