Infiniroot Blog: We sometimes write, too.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our technical achievements and share some of our implemented solutions.

Rancher 2 managed Kubernetes node slow due to Prometheus / How to find the reason for a slow node and dynamically adjust resource limits

Published on September 23rd 2020


For the last couple of days we've witnessed an increasing number of slow nodes in one of our production Kubernetes clusters managed by Rancher 2. The workaround was always to cordon and drain the affected node and then uncordon the node again. But what is actually causing this slowness? And why only on a single node and not on all nodes of the cluster? We're about to find out, the troubleshooting way!

Monitoring alerts: Extremely high CPU IOWAIT

Our Icinga 2 monitoring alerted about very high IOWAITs on the CPU of one particular Kubernetes node. These IOWAITs started all of a sudden (at 7am), as can be nicely seen in the following graph:

High IOWAIT on Kubernetes node

On the node itself, the IOWAITs were confirmed with iostat:

root@k8snode:~# iostat 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.01    0.00    1.63   45.04    1.76   49.56

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             0.00         0.00         0.00          0          0
loop3             0.00         0.00         0.00          0          0
loop4             0.00         0.00         0.00          0          0
loop5             0.00         0.00         0.00          0          0
nvme0n1          83.00         0.00      1020.00          0       1020
nvme1n1         100.00       400.00        32.00        400         32
dm-0            132.00         0.00      1020.00          0       1020

After a hardware defect (a dying disk could be the reason for such IOWAIT spikes) was ruled out, it was time to find the container causing this.

Using docker tools to find the bad container

Obviously some container is using a lot of IO. Using docker stats all containers can be listed with their block i/o statistics:

root@k8snode:~# docker stats --format "table {{.Container}}\t{{.BlockIO}}" --no-stream --no-trunc
CONTAINER           BLOCK I/O
62c08d56d45d        262kB / 0B
82f35e25edf3        0B / 57.3kB
13fa7f80c99b        0B / 24.6kB
e445587401aa        0B / 847MB
04eb41b7a084        0B / 1.86GB
27eaa630ec84        0B / 0B
3341a7bea2d7        0B / 0B
21ac333b0e33        0B / 32.8kB
700aee319ee6        0B / 0B
e36a3d2dd55b        0B / 32.8kB
dafe40a694ed        0B / 32.8kB
293e4febf708        0B / 32.8kB
6c75c58a1830        0B / 0B
a27279e0681b        0B / 0B
7a1e141f83f9        0B / 0B
0ab7ce23d3ab        0B / 0B
7ae0a5533bfb        0B / 0B
0080651a7fc8        0B / 0B
5e49d58e98c7        0B / 0B
e0dcf15afd92        262kB / 0B
260a9dc57cda        61.5GB / 478MB
f60d8c63409b        0B / 0B
a428551fed97        0B / 0B
2ecb2bc61ab1        0B / 79.4MB
34a18c45cb50        0B / 0B

One container (260a9dc57cda) stand out with a total of 61.5 GB (!!!) read and 478 MB written to the disk. Let's get some additional stats about that container:

root@k8snode:~# docker stats 260a9dc57cda --format "table {{.Container}}\t{{.BlockIO}}\t{{.CPUPerc}}\t{{.MemPerc}}\t{{.MemUsage}}" --no-stream
CONTAINER           BLOCK I/O           CPU %               MEM %               MEM USAGE / LIMIT
260a9dc57cda        62.5GB / 478MB      0.58%               99.97%              999.7MiB / 1000MiB

And while we're at it, let's find out the containers (human) name:

root@k8snode:~# docker stats 260a9dc57cda --format "table {{.Container}}\t{{.Name}}" --no-stream
CONTAINER           NAME
260a9dc57cda        k8s_prometheus_prometheus-cluster-monitoring-0_cattle-prometheus_ca3e7366-fc1f-11ea-acee-067cfe15a57a_1

So we've got two interesting facts here:

1) This is the container running Prometheus monitoring. In this Rancher 2 cluster monitoring is enabled. When cluster monitoring is enabled, Rancher 2 fires up Prometheus in the background. Seems like Prometheus is running on this particular node and is somehow causing problems.

2) The memory usage in percent column shows almost 100% memory used. According to the docker stats documentation, MemPerc means:

the percentage of the host’s CPU and memory the container is using

But the host itself still has plenty of memory available. This can be verified with free:

root@k8snode:~# free
              total        used        free      shared  buff/cache   available
Mem:       32525732     7304924    12375740        4940    12845068    26222492
Swap:       4194300     1069696     3124604

And systems monitoring also confirms there is still a lot of memory available:

Kubernetes node memory usage

What does the Rancher 2 user interface have to say?

Now that it is known that Prometheus (the Kubernetes monitoring enabled by Rancher 2) is causing problems, what does the Rancher 2 user interface have to say?

Interestingly the overview of the affected cluster already showed a red "Monitoring API is not ready", which kind of gives a hint:

Rancher 2 shows monitoring api is not ready

To view details, one can change into the "System" project and within the "cattle-prometheus" namespace, the "prometheus-cluster-monitoring" workload is showing problems (although still marked in Active State). To be more precise: The pod(s) of the workload are showing up as red (see "Scale" column):

Rancher 2 System project: Prometheus Cluster Monitoring down

Swapping is slow

With the current findings we know that:

Using iotop on the node reveals kworker flush and our now known prometheus container as top IO processes:

root@k8snode:~# iotop
Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     971.53 K/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
 9854 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kworker/u16:2+flush-259:1]
22841 be/4 ubuntu      0.00 B/s    0.00 B/s  0.00 % 27.46 % prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheu
s/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention=12h --web.enable-lif
ecycle --storage.tsdb.no-lockfile --web.route-prefix=/ --web.listen-address=127.0.0.1:9090
 1243 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.49 % containerd
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
[...]

Could this container be swapping? By using the process ID (22841), the PIDs memory consumption can be checked in /proc/22841/smaps. Each swap entry can be collected to find the full swap usage of this process:

root@k8snode:~# SWAPUSAGE=0; for SWAP in $(grep Swap /proc/22841/smaps|awk '{print $2}'); do let SWAPUSAGE=${SWAPUSAGE}+${SWAP}; done; echo $SWAPUSAGE
1992728

Yes, this process (and therefore this container) is definitely swapping. As it uses 100% memory (from what? we're about to find out later), memory is constantly being swapped out read again - causing extremely high I/O. So far this would explain the IOWAITs, but why is this happening?

Is there a limit? Searching with docker tools...

By using docker stats above, we found out that the container is using 100% of memory. But the node itself has still plenty of memory available. There is only one possibility: This container has resource limits!

However docker inspect didn't show any information related to limits. Maybe the container start command (docker run) would show more relevant information? To retrieve this information from an already running container, the runlike command is very helpful. This is a python program which can be installed using pip:

root@k8snode:~# pip install runlike

Once installed, runlike awaits the container id as input. From the previous troubleshooting steps we already know that ID:

root@k8snode:~# runlike 260a9dc57cda
docker run --name=k8s_prometheus_prometheus-cluster-monitoring-0_cattle-prometheus_ca3e7366-fc1f-11ea-acee-067cfe15a57a_1 --hostname=prometheus-cluster-monitoring-0 --user=1000 --env=ACCESS_PROMETHEUS_PORT_80_TCP_PORT=80 --env=KUBERNETES_PORT_443_TCP_PROTO=tcp --env=ACCESS_GRAFANA_SERVICE_HOST=10.43.47.52 --env=KUBERNETES_SERVICE_PORT_HTTPS=443 --env=KUBERNETES_PORT=tcp://10.43.0.1:443 --env=KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443 --env=ACCESS_PROMETHEUS_SERVICE_HOST=10.43.132.44 --env=ACCESS_PROMETHEUS_SERVICE_PORT=80 --env=ACCESS_PROMETHEUS_SERVICE_PORT_HTTP=80 --env=ACCESS_PROMETHEUS_PORT_80_TCP_PROTO=tcp --env=KUBERNETES_SERVICE_HOST=10.43.0.1 --env=ACCESS_GRAFANA_PORT_80_TCP_PROTO=tcp --env=ACCESS_GRAFANA_PORT_80_TCP_PORT=80 --env=KUBERNETES_PORT_443_TCP_PORT=443 --env=KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1 --env=ACCESS_GRAFANA_SERVICE_PORT_HTTP=80 --env=ACCESS_GRAFANA_PORT=tcp://10.43.47.52:80 --env=ACCESS_GRAFANA_PORT_80_TCP=tcp://10.43.47.52:80 --env=ACCESS_PROMETHEUS_PORT=tcp://10.43.132.44:80 --env=ACCESS_PROMETHEUS_PORT_80_TCP_ADDR=10.43.132.44 --env=KUBERNETES_SERVICE_PORT=443 --env=ACCESS_GRAFANA_SERVICE_PORT=80 --env=ACCESS_GRAFANA_PORT_80_TCP_ADDR=10.43.47.52 --env=ACCESS_PROMETHEUS_PORT_80_TCP=tcp://10.43.132.44:80 --env=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~empty-dir/config-out:/etc/prometheus/config_out:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~empty-dir/prometheus-cluster-monitoring-db:/prometheus' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~configmap/prometheus-cluster-monitoring-rulefiles-0:/etc/prometheus/rules/prometheus-cluster-monitoring-rulefiles-0:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~secret/secret-exporter-etcd-cert:/etc/prometheus/secrets/exporter-etcd-cert:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~configmap/configmap-prometheus-cluster-monitoring-nginx:/etc/prometheus/configmaps/prometheus-cluster-monitoring-nginx:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~secret/cluster-monitoring-token-5zhd7:/var/run/secrets/kubernetes.io/serviceaccount:ro' --volume=/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/etc-hosts:/etc/hosts --volume=/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/containers/prometheus/415dc099:/dev/termination-log --volume=/prometheus --network=container:ac6587b4a86b605b7937586f70efbc9e2fedf6fae45ea91ff46a8a7a4a0d0973 --workdir=/prometheus --expose=9090/tcp --restart=no --label='maintainer=The Prometheus Authors ' --label='annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log' --label='io.kubernetes.container.logpath=/var/log/pods/cattle-prometheus_prometheus-cluster-monitoring-0_ca3e7366-fc1f-11ea-acee-067cfe15a57a/prometheus/1.log' --label='annotation.io.kubernetes.container.terminationMessagePolicy=File' --label='io.kubernetes.pod.namespace=cattle-prometheus' --label='io.kubernetes.pod.name=prometheus-cluster-monitoring-0' --label='annotation.io.kubernetes.pod.terminationGracePeriod=600' --label='annotation.io.kubernetes.container.restartCount=1' --label='annotation.io.kubernetes.container.ports=[{"name":"http","containerPort":80,"protocol":"TCP"}]' --label='io.kubernetes.sandbox.id=ac6587b4a86b605b7937586f70efbc9e2fedf6fae45ea91ff46a8a7a4a0d0973' --label='io.kubernetes.docker.type=container' --label='io.kubernetes.pod.uid=ca3e7366-fc1f-11ea-acee-067cfe15a57a' --label='io.kubernetes.container.name=prometheus' --label='annotation.io.kubernetes.container.hash=19c463dc' --log-opt max-size=100m --log-opt max-file=5 --detach=true sha256:690f4cf8dee25c239aa517a16dd73392ecd81485b29cad881d901a99b5b1a303 --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention=12h --web.enable-lifecycle --storage.tsdb.no-lockfile --web.route-prefix=/ --web.listen-address=127.0.0.1:9090

But even in the original docker run command, no limits could be found. Is it surprising? Probably not because Kubernetes is used as orchestration layer to manage all these containers. So it's probably time to get into the Kubernetes cluster and search for answers there.

Troubleshooting using kubectl

First let's find out what exactly is running inside the cattle-prometheus namespace (which we found out from the Rancher UI above):

> kubectl get all -n cattle-prometheus
NAME                                                          READY   STATUS    RESTARTS   AGE
pod/exporter-kube-state-cluster-monitoring-75567dff98-5nl8c   1/1     Running   0          11d
pod/exporter-node-cluster-monitoring-4clsp                    1/1     Running   0          14d
pod/exporter-node-cluster-monitoring-7mmtr                    1/1     Running   3          97d
pod/exporter-node-cluster-monitoring-bwv7w                    1/1     Running   0          97d
pod/exporter-node-cluster-monitoring-mzmfz                    1/1     Running   0          97d
pod/exporter-node-cluster-monitoring-tj2hd                    1/1     Running   1          97d
pod/exporter-node-cluster-monitoring-vhntm                    1/1     Running   0          97d
pod/exporter-node-cluster-monitoring-wgtbz                    1/1     Running   0          97d
pod/exporter-node-cluster-monitoring-wjxdw                    1/1     Running   0          97d
pod/grafana-cluster-monitoring-58459df585-96w4r               2/2     Running   0          39h
pod/prometheus-cluster-monitoring-0                           5/5     Running   31         39h
pod/prometheus-operator-monitoring-operator-c474f7475-8zsqq   1/1     Running   0          97d

NAME                                    TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/access-grafana                  ClusterIP   10.43.47.52            80/TCP              97d
service/access-prometheus               ClusterIP   10.43.132.44           80/TCP              97d
service/expose-grafana-metrics          ClusterIP   None                   3000/TCP            97d
service/expose-kube-cm-metrics          ClusterIP   None                   10252/TCP           97d
service/expose-kube-etcd-metrics        ClusterIP   None                   2379/TCP            97d
service/expose-kube-scheduler-metrics   ClusterIP   None                   10251/TCP           97d
service/expose-kubelets-metrics         ClusterIP   None                   10250/TCP           97d
service/expose-kubernetes-metrics       ClusterIP   None                   8080/TCP,8081/TCP   97d
service/expose-node-metrics             ClusterIP   None                   9796/TCP            97d
service/expose-operator-metrics         ClusterIP   None                   8080/TCP            97d
service/expose-prometheus-metrics       ClusterIP   None                   9090/TCP            97d
service/prometheus-operated             ClusterIP   None                   9090/TCP            97d

NAME                                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
daemonset.apps/exporter-node-cluster-monitoring   8         8         6       8            6           beta.kubernetes.io/os=linux   97d

NAME                                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/exporter-kube-state-cluster-monitoring    1/1     1            1           97d
deployment.apps/grafana-cluster-monitoring                1/1     1            1           97d
deployment.apps/prometheus-operator-monitoring-operator   1/1     1            1           97d

NAME                                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/exporter-kube-state-cluster-monitoring-75567dff98   1         1         1       97d
replicaset.apps/grafana-cluster-monitoring-576c5d879f               0         0         0       97d
replicaset.apps/grafana-cluster-monitoring-58459df585               1         1         1       39h
replicaset.apps/grafana-cluster-monitoring-664d8494c6               0         0         0       39h
replicaset.apps/prometheus-operator-monitoring-operator-c474f7475   1         1         1       97d

NAME                                             READY   AGE
statefulset.apps/prometheus-cluster-monitoring   1/1     97d

The kubectl get all command inside the cattle-prometheus namespace (-n cattle-prometheus) shows all the pods, services and actual deployments. Most interesting here is the pod/prometheus-cluster-monitoring-0 (highlighted in the above output). It was already restarted 31 times - definitely an eye-catching fact.

Let's find out more about this specific pod:

> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0
Name:               prometheus-cluster-monitoring-0
Namespace:          cattle-prometheus
Priority:           0
PriorityClassName: 
Node:               k8snode/10.11.3.61
Start Time:         Mon, 21 Sep 2020 15:33:25 +0000
Labels:             app=prometheus
                    chart=prometheus-0.0.1
                    controller-revision-hash=prometheus-cluster-monitoring-6ff5559496
                    monitoring.coreos.com=true
                    prometheus=cluster-monitoring
                    release=cluster-monitoring
                    statefulset.kubernetes.io/pod-name=prometheus-cluster-monitoring-0
Annotations:        cattle.io/timestamp: 2020-09-21T15:22:26Z
                    cni.projectcalico.org/podIP: 10.42.5.152/32
                    field.cattle.io/ports:
                      [[{"containerPort":80,"dnsName":"prometheus-cluster-monitoring-","name":"http","protocol":"TCP","sourcePort":0}],[{"containerPort":9090,"d...
Status:             Running
IP:                 10.42.5.152
Controlled By:      StatefulSet/prometheus-cluster-monitoring
Containers:
  prometheus:
    Container ID:  docker://260a9dc57cdaae59106fad9eabfd3a346641a71ee103453e4335afa46ada6706
    Image:         rancher/prom-prometheus:v2.7.1
    Image ID:      docker-pullable://rancher/prom-prometheus@sha256:b32864cfe5f7f5d146820ddc6dcc5f78f40d8ab6d55e80decd5fc222b16344dc
    Port:          80/TCP
    Host Port:     0/TCP
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --storage.tsdb.retention=12h
      --web.enable-lifecycle
      --storage.tsdb.no-lockfile
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
    State:          Running
      Started:      Mon, 21 Sep 2020 15:33:39 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1000Mi

    Requests:
      cpu:        750m
      memory:     750Mi
    Environment: 

[... cutting mounts and rest until events ...]

Events:
  Type     Reason     Age                     From                      Message
  ----     ------     ----                    ----                      -------
  Warning  Unhealthy  43m (x223 over 24h)     kubelet, onl-radoaie26-p  Liveness probe failed: Get http://10.42.5.152:9090/-/healthy: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    9m13s (x126 over 103m)  kubelet, onl-radoaie26-p  Back-off restarting failed container
  Warning  Unhealthy  4m16s (x328 over 24h)   kubelet, onl-radoaie26-p  Readiness probe failed: Get http://10.42.5.152:9090/-/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

And at last, the limits were found! From the pod's description we can see that a cpu limit of 1 (which means Prometheus can use 100% of one CPU core of the node) and a memory limit of 1000MB were defined. 1000MB - haven't we seen this before?

Memory usage: 999.7MiB / 1000MiB

Ah yes, we did! This finally explains why docker stats showed a ~100% memory usage of this container; because it has a 1000MB memory limit and Prometheus in it simply uses up everything. As this is a 8 node Kubernetes cluster in production and with quite some workloads and pods deployed, it makes sense that Prometheus after a short period of time (39 hours according to kubectl get all output above) is simply getting overwhelmed with the amount of data coming in.

The solution? Increase that memory limit and give Prometheus more air (memory) to breath!

Dynamically adjusting the memory limit of a workload/StatefulSet

In Kubernetes, limits are defined on a deployment (workload in Rancher 2 terms) and not on a single pod - or basically the deployment (e.g. compose file) tells Kubernetes what kinds of limits to apply on which pods part of that deployment.

Now the big question is: Can we adjust these limits in Kubernetes? The answer is: YES (see this stackoverflow question)! With kubectl edit an already existing/deployed deployment can be adjusted.

From the pod description above we can see which service is responsible for the pod (= manages the pod):

> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0 | grep -i controlled
Controlled By:      StatefulSet/prometheus-cluster-monitoring

With this information, we can now run kubectl edit on StatefulSet/prometheus-cluster-monitoring:

Adjusting an already deployed Kubernetes deployment with kubectl edit

> kubectl edit StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus

[...]
        resources:
          limits:
            cpu: "1"
            memory: 1000Mi
          requests:
            cpu: 750m
            memory: 750Mi
[...]

kubectl edit (on a correct target deployment) basically opens an editor where the relevant configuration can be adjusted. In this case the 1000Mi limit was increased to 2000Mi:

So far so good - but the Prometheus workload or at least the prometheus-cluster-monitoring-0 pod needs to be restarted.

How to force a new deployment/restart of a Kubernetes service

The service responsible for the pod prometheus-cluster-monitoring-0 (which we know is StatefulSet/prometheus-cluster-monitoring) can now be told to downscale its number of pods to zero by setting --replicas=0:

> kubectl scale StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus --replicas=0
statefulset.apps/prometheus-cluster-monitoring scaled

Right after this command, the pods inside the cattle-prometheus can be listed and this shows that pod prometheus-cluster-monitoring-0 is terminating:

> kubectl get pods -n cattle-prometheus
NAME                                                      READY   STATUS        RESTARTS   AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c   1/1     Running       0          11d
exporter-node-cluster-monitoring-4clsp                    1/1     Running       0          14d
exporter-node-cluster-monitoring-7mmtr                    1/1     Running       3          97d
exporter-node-cluster-monitoring-bwv7w                    1/1     Running       0          97d
exporter-node-cluster-monitoring-mzmfz                    1/1     Running       0          97d
exporter-node-cluster-monitoring-tj2hd                    1/1     Running       1          97d
exporter-node-cluster-monitoring-vhntm                    1/1     Running       0          97d
exporter-node-cluster-monitoring-wgtbz                    1/1     Running       0          97d
exporter-node-cluster-monitoring-wjxdw                    1/1     Running       0          97d
grafana-cluster-monitoring-58459df585-96w4r               2/2     Running       0          40h
prometheus-cluster-monitoring-0                           4/5     Terminating   35         40h
prometheus-operator-monitoring-operator-c474f7475-8zsqq   1/1     Running       0          97d

A few seconds later, the pod is gone:

> kubectl get pods -n cattle-prometheus
NAME                                                      READY   STATUS    RESTARTS   AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c   1/1     Running   0          11d
exporter-node-cluster-monitoring-4clsp                    1/1     Running   0          14d
exporter-node-cluster-monitoring-7mmtr                    1/1     Running   3          97d
exporter-node-cluster-monitoring-bwv7w                    1/1     Running   0          97d
exporter-node-cluster-monitoring-mzmfz                    1/1     Running   0          97d
exporter-node-cluster-monitoring-tj2hd                    1/1     Running   1          97d
exporter-node-cluster-monitoring-vhntm                    1/1     Running   0          97d
exporter-node-cluster-monitoring-wgtbz                    1/1     Running   0          97d
exporter-node-cluster-monitoring-wjxdw                    1/1     Running   0          97d
grafana-cluster-monitoring-58459df585-96w4r               2/2     Running   0          40h
prometheus-operator-monitoring-operator-c474f7475-8zsqq   1/1     Running   0          97d

Now the service can be set to scale 1 again:

> kubectl scale StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus --replicas=1
statefulset.apps/prometheus-cluster-monitoring scaled

And the pod prometheus-cluster-monitoring-0 shows up again:

> kubectl get pods -n cattle-prometheus
NAME                                                      READY   STATUS    RESTARTS   AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c   1/1     Running   0          11d
exporter-node-cluster-monitoring-4clsp                    1/1     Running   0          14d
exporter-node-cluster-monitoring-7mmtr                    1/1     Running   3          97d
exporter-node-cluster-monitoring-bwv7w                    1/1     Running   0          97d
exporter-node-cluster-monitoring-mzmfz                    1/1     Running   0          97d
exporter-node-cluster-monitoring-tj2hd                    1/1     Running   1          97d
exporter-node-cluster-monitoring-vhntm                    1/1     Running   0          97d
exporter-node-cluster-monitoring-wgtbz                    1/1     Running   0          97d
exporter-node-cluster-monitoring-wjxdw                    1/1     Running   0          97d
grafana-cluster-monitoring-58459df585-96w4r               2/2     Running   0          40h
prometheus-cluster-monitoring-0                           5/5     Running   1          16s
prometheus-operator-monitoring-operator-c474f7475-8zsqq   1/1     Running   0          97d

Note: According to this article on medium, Kubernetes 1.15 introduced the command kubectl rollout restart for this purpose. But as this cluster still runs on Kubernetes 1.14 the command would not work and we had to go with the (re-)scale solution.

Now to the most important part: Did the rescaling/redeployment do the trick and apply the new memory limit? Let's verify with kubectl describe:

> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0 | grep memory
      memory:  2000Mi
      memory:     750Mi
      memory:  50Mi
      memory:  50Mi
      memory:  100Mi
      memory:  50Mi
      memory:  200Mi
      memory:   100Mi
      memory:  25Mi
      memory:     25Mi

Yes, the very first memory limit is now showing up with 2000Mi (instead of 1000Mi)!

Did it help?

A lot of troubleshooting, in-depth Docker and Kubernetes commands and finally a (dynamic) change of pod limits - but did this help? Immediately after the Prometheus pod was redeployed, iostat on the node confirmed that the IOWAITs were gone:

root@k8snode:~# iostat 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.74    0.00    1.12    0.00    1.75   91.40

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          0          0
loop1             0.00         0.00         0.00          0          0
loop2             0.00         0.00         0.00          0          0
loop3             0.00         0.00         0.00          0          0
loop4             0.00         0.00         0.00          0          0
loop5             0.00         0.00         0.00          0          0
nvme0n1           0.00         0.00         0.00          0          0
nvme1n1           5.00         0.00       108.00          0        108
dm-0              0.00         0.00         0.00          0          0

And the graphs from our monitoring confirmed, too (shortly before 9:40 am):

In Rancher 2, the cluster overview started showing the cluster statistics (read from Prometheus) again:

And inside the "System" project, the prometheus-cluster-monitoring-0 pod inside the prometheus-cluster-monitoring workload shows up green again:

The new memory limits can also be seen in Rancher 2 when manually editing the prometheus-cluster-monitoring workload:

Will the new limit of 2000MB be enough for Prometheus in this particular cluster? Only time will tell...

Don't forget to tell Rancher 2 about it, too!

Although the active prometheus-cluster-monitoring service in the Kubernetes cluster is now set to a new limit of 2000MB, the original monitoring settings in Rancher 2 are still the same. So if monitoring were to be disabled on the cluster level and the enabled again (or a newer version of monitoring would be deployed), the original settings (with a limit of 1000MB) would be defined again.

But the default Rancher 2 default settings can be changed, too. Inside the cluster, click on Tools -> Monitoring.

Rancher 2 Tools Monitoring

Then increase the memory limit (Prometheus Memory Limit) in the form and save:

Rancher 2 Prometheus Monitoring Limit Settings