Prometheus monitoring

Monitoring can be quite a complex task, as it not only requires tools and connecting them. It also requires dashboards and senseful alert thresholds. This is very time consuming and requires a lot of experience.

Setup

For this the Kubernetes ecosystem has everything ready to go packages in the form of the "prometheus-stack" helm chart.

Copy the following text into a file called helm_prometheus-stack.yaml.

prometheus:
  ingress:
    enabled: true
    hosts:
    - prometheus-127-0-0-1.nip.io
    tls:
    - hosts:
      - "prometheus-127-0-0-1.nip.io"
      secretName: prometheus-127-0-0-1-nip-io
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      cert-manager.io/cluster-issuer: "selfsigned-issuer"
  logLevel: debug
alertmanager:
  enabled: true
  ingress:
    enabled: true
    hosts:
    - alertmanager-127-0-0-1.nip.io
    tls:
    - hosts:
      - "alertmanager-127-0-0-1.nip.io"
      secretName: alertmanager--127-0-0-1-nip-io
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      cert-manager.io/cluster-issuer: "selfsigned-issuer"
  alertmanagerSpec:
    alertmanagerConfigSelector:
      matchExpressions:
        - key: alertmanagerConfig
          operator: In
          values:
            - thisnamehastomatch
grafana:
  persistence:
    enabled: true
  imageRenderer:
    enabled: true
  adminPassword: password123
  ingress:
    enabled: true
    hosts:
    - grafana-127-0-0-1.nip.de
    tls:
    - hosts:
      - "grafana-127-0-0-1.nip.io"
      secretName: grafana--127-0-0-1-nip-io
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      cert-manager.io/cluster-issuer: "selfsigned-issuer"
  env:
    GF_DATE_FORMATS_INTERVAL_DAY: "MMM YYYY"

Apply the chart with the following command:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install --namespace monitoring --create-namespace \
prometheus-operator prometheus-community/kube-prometheus-stack \
-f helm_prometheus-stack.yaml

After some time (you can check with kubectl get pods -n monitoring -w) everything is up and running. You can access the following web apps:

In addition to these apps a lot of dashboards have been deployed, as well as lots of Prometheus alerts.

Play

Within the following section gives a short tour round the three mentioned tools.

Prometheus

Prometheus is the standard monitoring tool originally developed by Soundcloud. This video gives some nice impressions on the eco system. To get a real deep dive videos from PromCon can be watched (2022,2023).

In contrast to earlier days monitoring systems, Prometheus pulls metrics from so called "exporters". You can watch the currently available exporters on your local setup. Note that some exporters don't work as the setup from minikube differs from "normal" setups.

The helm chart used also configures a lot of alerts, see your local setup. Take a look at the alert NodeCPUHighUsage. You can click on the expr to open the viewer. The expression you are clicking on is so called "PromQL", the Prometheus query language. This is a dedicated query language to create monitoring expressions. Modify the query slightly to show the overall cpu usage sum(rate(node_cpu_seconds_total{job="node-exporter", mode!="idle"}[2m])) by (instance). This will show you the cpu load of your node.

Copy the following into a file called deployment_highcpu.yaml. This will start a deployment that uses 1 cpu.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-ng
  labels:
    app: stress-ng
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress-ng
  template:
    metadata:
      labels:
        app: stress-ng
    spec:
      containers:
      - name: stress-ng
        image: alexeiled/stress-ng:latest-ubuntu
        imagePullPolicy: Always
        command: ["stress-ng"]
        args:
        - "--cpu"
        - "1"

Apply the yaml file as follows:

# apply
kubectl apply -n default -f deployment_highcpu.yaml
# remove the pod in case something goes wrong as follows
kubectl delete deploy stress-ng -n default

Check the prometheus graph. A high cpu load should be shown. An alert will only be triggeres when CPU load is above 90% for longer than 15 minutes.

To get higher cpu usage scale up the deployment to start multiple containers. This can be done with the command kubectl scale deploy stress-ng --replicas 8 to go for 8 pods. Note that by default the old deployment will be terminated AFTER all other pods are started.

You can wait these 15 minutes for the alert to trigger... or continue right away. Don't forget to clean up (see above) in case you are worried about a hot laptop.

Grafana

Actually you can do the same (visualizing metrics) also in Grafana, but with a nicer interface. Head over to the explore section in Grafana to give it a try: click here for deep link)

Besides the explore feature the helm chart has already deployed a couple of dashboards. See here for a full list. See the node overview dashboard for an overview of the nodes. For minikube by default there is only one node.

To see some io load stress, edit the deployment to produce io load instead of cpu load. Replace --cpu with --io using the command kubectl edit deploy -n default stress-ng.

Alertmanager

Where Grafana is mainly used for visualizing metrics, Prometheus for collecting metrics, Alertmanager is used to send alerts to various receivers. Copy and paste the following into a file called alertmanager_config.yaml. To make this actually work, you need to replace telegramtoken with an actual Telegram API token. You would also need to replace the chatID where the alerts are sent. Another alternative would be to configure another receiver.

---
apiVersion: v1
data:
  telegramtoken: xxxxxxtelegramxxxxtokenxxxx
kind: Secret
metadata:
  name: telegramtoken
type: Opaque
---
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-config
  labels:
    alertmanagerConfig: thisnamehastomatch
spec:
  route:
    groupBy: ['job','alertname']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'telegram'
    routes:
      - groupBy: ['job','alertname']
        groupWait: 30s
        groupInterval: 5m
        repeatInterval: 12h
        receiver: 'blackhole'
        matchers:
        - name: alertname
          value: InfoInhibitor
  receivers:
  - name: telegram
    telegramConfigs:
    - botToken:
        name: telegramtoken
        key: telegramtoken
      chatID: -12345678
      apiURL: "https://api.telegram.org"
      parseMode: "HTML"
  - name: blackhole

This config needs to be applied in every namespace where you want to have alerting. For example with the following config:

# alert configuration
NAMESPACES_TO_ALERT=( kube-system monitoring cert-manager ingress-nginx )
for i in "${NAMESPACES_TO_ALERT[@]}"; do
  kubectl apply -f alertmanager_config.yaml -n $i
done

Tear down

helm uninstall -n monitoring prometheus-operator