Monitoring the LGTM stack itself makes the observability system visible when ingestion, storage, queries, or alerting degrade. Grafana, Loki, Tempo, and Mimir expose metrics that should be scraped, queried, dashboarded, and alerted like any other production service.
Self-monitoring should include both component metrics and synthetic telemetry. Component metrics show process health, rings, queues, and object storage errors, while synthetic logs, traces, and metrics prove that the end-to-end signal paths still work.
Keep the alerts focused on failures that require operator action. Readiness loss, failed remote write, object storage errors, ingestion drops, query failures, and notification delivery failures are stronger alert signals than raw pod restarts by themselves.
$ kubectl get servicemonitor,podmonitor --namespace monitoring NAME AGE servicemonitor.monitoring.coreos.com/grafana 12d servicemonitor.monitoring.coreos.com/loki 12d servicemonitor.monitoring.coreos.com/tempo 12d servicemonitor.monitoring.coreos.com/mimir 12d
If the cluster does not use the Prometheus Operator, configure the equivalent scrape jobs in the active metrics collector.
$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
--data-urlencode 'query=up{namespace="monitoring"}'
{"status":"success","data":{"resultType":"vector","result":[]}}
$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
--data-urlencode 'query=sum(rate(loki_distributor_lines_received_total[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}
$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
--data-urlencode 'query=sum(rate(tempo_receiver_accepted_spans[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}
$ curl --silent --get https://metrics.example.com/prometheus/api/v1/query \
--data-urlencode 'query=sum(rate(cortex_distributor_received_samples_total[5m]))'
{"status":"success","data":{"resultType":"vector","result":[]}}
Mimir exposes many metrics with cortex_ prefixes for compatibility. Confirm exact metric names in the running version before final alert rollout.
apiVersion: 1 groups: - orgId: 1 name: lgtm-stack-health folder: Observability interval: 1m rules: - uid: loki-no-ingest title: Loki is not receiving log lines condition: A data: - refId: A relativeTimeRange: from: 300 to: 0 datasourceUid: mimir model: expr: sum(rate(loki_distributor_lines_received_total[5m])) == 0
$ helm upgrade --install grafana grafana/grafana \ --namespace monitoring \ --values values/grafana.yaml \ --values values/lgtm-alerts.yaml \ --wait Release "grafana" has been upgraded. Happy Helming!
$ curl --silent --user admin:<password> \ https://grafana.example.com/api/v1/provisioning/alert-rules ##### snipped ##### "title":"Loki is not receiving log lines"
$ curl --silent --include https://otel.example.com/v1/traces HTTP/2 405 content-type: text/plain
Use the organization's normal synthetic telemetry source when one exists. A GET check only proves the receiver is reachable; a complete check should write and query a known signal.
$ curl --silent --user admin:<password> \ https://grafana.example.com/api/search?query=LGTM ##### snipped ##### "title":"LGTM Stack Health"