How to monitor a Kubernetes cluster in Checkmk

Monitoring a Kubernetes cluster in Checkmk connects the cluster API, the Checkmk Kubernetes collectors, and Checkmk service discovery so nodes, pods, workloads, and usage data appear in monitoring. The connection matters when Kubernetes objects change often enough that manual host creation cannot keep pace with the cluster.

Checkmk reads basic cluster state through the Kubernetes special agent, while the Checkmk Node Collector and Cluster Collector provide usage data such as CPU, memory, and filesystem metrics. The collectors are installed in the cluster with the official Helm chart, and Checkmk queries the API server plus the Cluster Collector endpoint from a dedicated cluster host.

A NodePort Cluster Collector endpoint is shown because it is easy to verify from the shell. Use an Ingress endpoint instead when that is the approved exposure path for the cluster, and keep the service account token and CA certificate out of screenshots, shell history, and saved troubleshooting notes.

Steps to monitor a Kubernetes cluster in Checkmk:

  1. Add the official Checkmk Helm repository.
    $ helm repo add checkmk-chart https://checkmk.github.io/checkmk_kube_agent
    "checkmk-chart" has been added to your repositories
  2. Check the current Checkmk Kubernetes chart metadata.
    $ helm show chart checkmk-chart/checkmk
    apiVersion: v2
    appVersion: 1.11.0
    description: Helm chart for Checkmk - Your complete IT monitoring solution
    icon: https://checkmk.com/application/files/thumbnails/low_res/9515/9834/3872/checkmk_icon_main.png
    kubeVersion: '>=1.19.0-0'
    name: checkmk
    type: application
    version: 1.11.0

    The current chart declares the Kubernetes version range it supports. Stop here if the cluster is older than the chart's kubeVersion value.

  3. Create a values.yaml file for the Cluster Collector endpoint.
    values.yaml
    clusterCollector:
      service:
        type: NodePort
        nodePort: 30035

    Use the chart's clusterCollector.ingress settings instead when the cluster exposes services through Ingress. Keep the same endpoint choice through the Checkmk rule so the monitoring server queries the reachable address.

  4. Install the Checkmk collectors into the cluster.
    $ helm upgrade --install --create-namespace --namespace checkmk-monitoring myrelease checkmk-chart/checkmk -f values.yaml
    Release "myrelease" has been upgraded. Happy Helming!
    NAME: myrelease
    NAMESPACE: checkmk-monitoring
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    NOTES:
    You can access the checkmk cluster-collector via:
    NodePort:
      http://10.0.12.40:30035

    The collectors run with permissions that let Checkmk read cluster, node, pod, and workload state. Install them in a dedicated namespace and review the chart values before applying them to a production cluster.

  5. Confirm that Helm reports the release as deployed.
    $ helm status --namespace checkmk-monitoring myrelease
    NAME: myrelease
    NAMESPACE: checkmk-monitoring
    STATUS: deployed
    REVISION: 1
  6. Check the Cluster Collector service endpoint.
    $ kubectl get service --namespace checkmk-monitoring myrelease-checkmk-cluster-collector
    NAME                                  TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
    myrelease-checkmk-cluster-collector   NodePort   10.96.178.75   <none>        8080:30035/TCP   2m

    For NodePort access, combine a reachable node address with the node port. For Ingress access, use the hostname or URL shown by the ingress controller.

  7. Store the service account token in a temporary shell variable.
    $ TOKEN=$(kubectl get secret myrelease-checkmk-checkmk --namespace checkmk-monitoring --output jsonpath='{.data.token}' | base64 --decode)

    Do not print real tokens into shared logs or screenshots. Copy the token directly into the Checkmk password store in the next Checkmk-side step.

  8. Print the service account CA certificate for Checkmk import.
    $ kubectl get secret myrelease-checkmk-checkmk --namespace checkmk-monitoring --output jsonpath='{.data.ca\.crt}' | base64 --decode
    -----BEGIN CERTIFICATE-----
    MIIBdjCCAR2gAwIBAgIBADAKBggqhkjOPQQDAjAjMSEwHwYDVQQDDBhrM3Mtc2Vy
    ##### snipped #####
    -----END CERTIFICATE-----

    Copy the full certificate, including the BEGIN CERTIFICATE and END CERTIFICATE lines.

  9. Test the Cluster Collector metadata endpoint.
    $ curl --header "Authorization: Bearer $TOKEN" http://10.0.12.40:30035/metadata
    {
      "cluster_collector_metadata": {
        "host_name": "myrelease-checkmk-cluster-collector-7d8c6f8b5d-lxq2m",
        "checkmk_kube_agent": {
          "project_version": "1.11.0"
        }
      }
    }

    Replace the URL with the NodePort or Ingress endpoint that the Checkmk server can reach.

  10. Store the token in Checkmk at SetupGeneralPasswordsAdd password.

    Use a title such as Kubernetes production token so the later Kubernetes rule can select the entry without exposing the token value.

  11. Import the CA certificate in Checkmk at SetupGeneralGlobal settingsSite managementTrusted certificate authorities for SSL.

    The Kubernetes rule can then use certificate verification instead of disabling TLS checks for the API server.

  12. Create a cluster host in Checkmk with No IP as the IP address family.

    The cluster host receives the special-agent and piggyback data at cluster level; it is not a host that Checkmk should ping directly.
    Related: How to create a Checkmk piggyback host

  13. Configure dynamic host management for Kubernetes piggyback hosts when the site supports it.

    In commercial editions, create a connection under SetupHostsDynamic host management and restrict the source host to the cluster host. In Checkmk Community, use the piggyback orphan list and create the Kubernetes object hosts manually.

  14. Create the Kubernetes special agent rule at SetupAgentsVM, cloud, containerKubernetes.

    Set the cluster name, select the stored token, enter the Kubernetes API server endpoint, enable certificate verification, enable Enrich with usage data from Checkmk Cluster Collector, and enter the Cluster Collector endpoint.
    Related: How to create a Checkmk rule for selected hosts

  15. Restrict the Kubernetes special agent rule to the cluster host.

    Set ConditionsExplicit hosts to the cluster host. A broader condition can run the Kubernetes special agent for unrelated hosts and create confusing discovery results.

  16. Run service discovery on the cluster host.

    Accept the Kubernetes API and Cluster Collector services when discovery finds them.
    Related: How to run Checkmk service discovery

  17. Activate the pending Checkmk changes.

    Activation sends the saved host, rule, password-store, certificate, and discovery changes to monitoring.
    Related: How to activate Checkmk pending changes

  18. Check the cluster host services and Kubernetes dashboard.

    The cluster host should show Kubernetes API with Live, Ready in the summary, and Cluster Collector should show the collector version. In commercial editions, MonitorApplicationsKubernetes should show CPU and memory resource data, and the Kubernetes Cluster dashboard should show Primary datasource, Cluster collector, and API health as OK.