We sometimes write.

Of course we cannot always share details about our work with customers, but nevertheless it is nice to show our achievements and share some solutions.

Rancher 2 Kubernetes certificates expired! How to rotate your expired certificates?

Published on November 8th 2019 - see original post


One of the lesser known Kubernetes settings are x509 certificates. These are used to ensure secure communication between the cluster nodes (and also from "outside" to the cluster itself). When using Rancher 2 (2.0.x and 2.1.x) as management layer to set up a Kubernetes cluster (using RKE), the self-signed certificates are created with a one year validity.

In the Rancher documentation this is described as:

In Rancher v2.0.x and v2.1.x, the auto-generated certificates for Rancher-launched Kubernetes clusters have a validity period of one year, meaning these certificates will expire one year after the cluster is provisioned. The same applies to Kubernetes clusters provisioned by v0.1.x of the Rancher Kubernetes Engine (RKE) CLI.

Note: Kubernetes clusters can become quite complex! If the whole Kubernetes management, including upgrades, testing, etc. is too much hassle and effort to you and you just want to enjoy an available Kubernetes cluster to deploy your applications, check out the Private Kubernetes Cloud Infrastructure at Infiniroot!

Oh no, it's too late!

As life happens, these certificates are likely to be forgotten. As they are self-signed, they are (most likely) not monitored either. Result? One ends up with a broken Kubernetes cluster once the certificates expire. The check_rancher2 monitoring plugin recognized something defect and alerted at 15:08:

Additional Info: CHECK_RANCHER2 CRITICAL - controller-manager in cluster "local" is not healthy -

An attempted login into Rancher 2 fails and the user interface greets with a big error:

Rancher Kubernetes certificate expired

x509: certificate has expired or is not yet valid

All existing cussing and bad words used, one self-reflects and knows whom to blame: Ourselves. Because the one year validity of the certificates was known, even raised eyebrows during the cluster creation and yet here we are - with a broken cluster.

Kubernetes certificates

The certificates are stored locally on the etcd and controle plane nodes. This means in a proper Kubernetes cluster there are are least three nodes having these certificates. They can be found on the local file system in /etc/kubernetes/.tmp/:

root@onl-ranx01-p:/etc/kubernetes/.tmp# ll
total 132
-rw-r--r-- 1 root root 3777 Nov  8  2018 cluster-state.yml
-rw-r--r-- 1 root root 1679 Nov  8  2018 kube-admin-key.pem
-rw-r--r-- 1 root root 1070 Nov  8  2018 kube-admin.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-apiserver-key.pem
-rw-r--r-- 1 root root 1261 Nov  8  2018 kube-apiserver.pem
-rw-r--r-- 1 root root 1679 Nov  8  2018 kube-apiserver-proxy-client-key.pem
-rw-r--r-- 1 root root 1107 Nov  8  2018 kube-apiserver-proxy-client.pem
-rw-r--r-- 1 root root 1679 Nov  8  2018 kube-apiserver-requestheader-ca-key.pem
-rw-r--r-- 1 root root 1082 Nov  8  2018 kube-apiserver-requestheader-ca.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-ca-key.pem
-rw-r--r-- 1 root root 1017 Nov  8  2018 kube-ca.pem
-rw-r--r-- 1 root root 5387 Nov  8  2018 kubecfg-kube-admin.yaml
-rw-r--r-- 1 root root  517 Nov  8  2018 kubecfg-kube-apiserver-proxy-client.yaml
-rw-r--r-- 1 root root  533 Nov  8  2018 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw-r--r-- 1 root root  501 Nov  8  2018 kubecfg-kube-controller-manager.yaml
-rw-r--r-- 1 root root  445 Nov  8  2018 kubecfg-kube-node.yaml
-rw-r--r-- 1 root root  449 Nov  8  2018 kubecfg-kube-proxy.yaml
-rw-r--r-- 1 root root  465 Nov  8  2018 kubecfg-kube-scheduler.yaml
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-controller-manager-key.pem
-rw-r--r-- 1 root root 1062 Nov  8  2018 kube-controller-manager.pem
-rw-r--r-- 1 root root 1679 Nov  8  2018 kube-etcd-10-10-1-127-key.pem
-rw-r--r-- 1 root root 1253 Nov  8  2018 kube-etcd-10-10-1-127.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-etcd-10-10-1-208-key.pem
-rw-r--r-- 1 root root 1253 Nov  8  2018 kube-etcd-10-10-1-208.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-etcd-10-10-2-6-key.pem
-rw-r--r-- 1 root root 1253 Nov  8  2018 kube-etcd-10-10-2-6.pem
-rw-r--r-- 1 root root 1679 Nov  8  2018 kube-node-key.pem
-rw-r--r-- 1 root root 1070 Nov  8  2018 kube-node.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-proxy-key.pem
-rw-r--r-- 1 root root 1046 Nov  8  2018 kube-proxy.pem
-rw-r--r-- 1 root root 1675 Nov  8  2018 kube-scheduler-key.pem
-rw-r--r-- 1 root root 1050 Nov  8  2018 kube-scheduler.pem

Quick verification: Yes, today is November 8th. 2019. Want more proof?

root@onl-ranx01-p:/etc/kubernetes/.tmp# stat kube-apiserver-requestheader-ca.pem
  File: 'kube-apiserver-requestheader-ca.pem'
  Size: 1082          Blocks: 8          IO Block: 4096   regular file
Device: 10302h/66306d    Inode: 535237      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-11-08 16:15:27.042564333 +0100
Modify: 2018-11-08 15:03:00.317899426 +0100
Change: 2018-11-08 15:03:00.317899426 +0100
 Birth: -

Indeed. The certificate was created November 8th 2018 at 15:03 local time. We got the alert from our monitoring at 15:08 that something was wrong with the cluster. The good news: The monitoring plugin check_rancher2 works.

Rotate the expired certificates

This is the Kubernetes cluster on which the Rancher management itself runs. This means this cluster was created using rke. Luckily rke can be used again to rotate the certificates.

First the newest rke version should be downloaded. As of this writing this is 0.3.2:

root@linux:~# wget https://github.com/rancher/rke/releases/download/v0.3.2/rke_linux-amd64

To avoid version mistakes in the future, we suggest to rename the rke_linux-amd64 binary:

root@linux:~# mv rke_linux-amd64 rke_linux-amd64-0.3.2
root@linux:~# chmod 700 rke_linux-amd64-0.3.2
root@linux:~# ll |grep rke
-rwx------ 1 root root 44530041 Oct 28 19:34 rke_linux-amd64-0.3.2
-rwx------ 1 root root 31767603 Jul  7  2018 rke_linux-amd64-1.8
-rwx------ 1 root root 31583460 Aug 10  2018 rke_linux-amd64-1.9

With the yaml config file of this cluster, rke can now be used to run the "cert rotate" command on the cluster. This gives a lot of output:

root@linux:~# ./rke_linux-amd64-0.3.2 cert rotate --config 3-node-rancher-cluster.yml
INFO[0000] Running RKE version: v0.3.2                 
INFO[0000] Initiating Kubernetes cluster               
INFO[0000] [state] Possible legacy cluster detected, trying to upgrade
INFO[0000] [reconcile] Rebuilding and updating local kube config
INFO[0000] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0000] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0000] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0000] [state] Fetching cluster state from Kubernetes
INFO[0030] Timed out waiting for kubernetes cluster to get state
WARN[0030] Failed to fetch state from kubernetes: Timeout waiting for kubernetes cluster to get state
INFO[0030] [dialer] Setup tunnel for host [10.10.1.208]
INFO[0030] [dialer] Setup tunnel for host [10.10.1.127]
INFO[0030] [dialer] Setup tunnel for host [10.10.2.6]  
INFO[0030] [state] Fetching cluster state from Nodes   
INFO[0030] Checking if container [cluster-state-deployer] is running on host [10.10.1.127], try #1
INFO[0030] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0030] Image [rancher/rke-tools:v0.1.50] does not exist on host [10.10.1.127]: Error: No such image: rancher/rke-tools:v0.1.50
INFO[0030] Pulling image [rancher/rke-tools:v0.1.50] on host [10.10.1.127], try #1
INFO[0036] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0036] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0037] Starting container [cluster-state-deployer] on host [10.10.1.127], try #1
INFO[0037] [state] Successfully started [cluster-state-deployer] container on host [10.10.1.127]
INFO[0037] [state] Successfully fetched cluster state from Nodes
INFO[0037] [certificates] Getting Cluster certificates from Kubernetes
WARN[0037] Failed to fetch certs from kubernetes: Get https://10.10.1.208:6443/api/v1/namespaces/kube-system/secrets/kube-ca?timeout=30s: x509: certificate has expired or is not yet valid
INFO[0037] [certificates] Fetching kubernetes certificates from nodes
INFO[0037] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0037] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0037] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0038] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0039] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0040] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0040] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0040] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0040] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0040] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0041] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0042] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0042] Checking if container [cert-fetcher] is running on host [10.10.1.127], try #1
INFO[0042] Removing container [cert-fetcher] on host [10.10.1.127], try #1
INFO[0042] [certificates] Creating service account token key
INFO[0042] Successfully Deployed state file at [RANCHER_CLUSTER_PROD/3-node-rancher-prod.rkestate]
INFO[0042] Rotating Kubernetes cluster certificates    
INFO[0042] [certificates] Generating Kubernetes API server certificates
INFO[0042] [certificates] Generating Kube Controller certificates
INFO[0042] [certificates] Generating Kube Scheduler certificates
INFO[0042] [certificates] Generating Kube Proxy certificates
INFO[0043] [certificates] Generating Node certificate  
INFO[0043] [certificates] Generating admin certificates and kubeconfig
INFO[0043] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0043] [certificates] Generating etcd-10.10.1.127 certificate and key
INFO[0043] [certificates] Generating etcd-10.10.2.6 certificate and key
INFO[0043] [certificates] Generating etcd-10.10.1.208 certificate and key
INFO[0043] Successfully Deployed state file at [RANCHER_CLUSTER_PROD/3-node-rancher-prod.rkestate]
INFO[0043] Rebuilding Kubernetes cluster with rotated certificates
INFO[0043] [dialer] Setup tunnel for host [10.10.1.208]
INFO[0043] [dialer] Setup tunnel for host [10.10.1.127]
INFO[0043] [dialer] Setup tunnel for host [10.10.2.6]  
INFO[0044] [certificates] Deploying kubernetes certificates to Cluster nodes
INFO[0044] Checking if container [cert-deployer] is running on host [10.10.1.127], try #1
INFO[0044] Checking if container [cert-deployer] is running on host [10.10.2.6], try #1
INFO[0044] Checking if container [cert-deployer] is running on host [10.10.1.208], try #1
INFO[0044] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0044] Image [rancher/rke-tools:v0.1.50] does not exist on host [10.10.1.208]: Error: No such image: rancher/rke-tools:v0.1.50
INFO[0044] Pulling image [rancher/rke-tools:v0.1.50] on host [10.10.1.208], try #1
INFO[0044] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0044] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0044] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0044] Image [rancher/rke-tools:v0.1.50] does not exist on host [10.10.2.6]: Error: No such image: rancher/rke-tools:v0.1.50
INFO[0044] Pulling image [rancher/rke-tools:v0.1.50] on host [10.10.2.6], try #1
INFO[0044] Starting container [cert-deployer] on host [10.10.1.127], try #1
INFO[0044] Checking if container [cert-deployer] is running on host [10.10.1.127], try #1
INFO[0049] Checking if container [cert-deployer] is running on host [10.10.1.127], try #1
INFO[0049] Removing container [cert-deployer] on host [10.10.1.127], try #1
INFO[0050] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0050] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0050] Starting container [cert-deployer] on host [10.10.1.208], try #1
INFO[0051] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0051] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0051] Checking if container [cert-deployer] is running on host [10.10.1.208], try #1
INFO[0051] Starting container [cert-deployer] on host [10.10.2.6], try #1
INFO[0051] Checking if container [cert-deployer] is running on host [10.10.2.6], try #1
INFO[0056] Checking if container [cert-deployer] is running on host [10.10.1.208], try #1
INFO[0056] Removing container [cert-deployer] on host [10.10.1.208], try #1
INFO[0056] Checking if container [cert-deployer] is running on host [10.10.2.6], try #1
INFO[0056] Removing container [cert-deployer] on host [10.10.2.6], try #1
INFO[0056] [reconcile] Rebuilding and updating local kube config
INFO[0056] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0056] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0056] Successfully Deployed local admin kubeconfig at [RANCHER_CLUSTER_PROD/kube_config_3-node-rancher-prod.yml]
INFO[0056] [certificates] Successfully deployed kubernetes certificates to Cluster nodes
INFO[0056] Successfully Deployed state file at [RANCHER_CLUSTER_PROD/3-node-rancher-prod.rkestate]
INFO[0056] [etcd] Restarting up etcd plane..           
INFO[0056] Restarting container [etcd] on host [10.10.1.208], try #1
INFO[0056] Restarting container [etcd] on host [10.10.1.127], try #1
INFO[0056] Restarting container [etcd] on host [10.10.2.6], try #1
INFO[0062] [restart/etcd] Successfully restarted container on host [10.10.1.127]
INFO[0062] [restart/etcd] Successfully restarted container on host [10.10.1.208]
INFO[0062] [restart/etcd] Successfully restarted container on host [10.10.2.6]
INFO[0062] [etcd] Successfully restarted etcd plane..  
INFO[0062] [controlplane] Check if rotating a legacy cluster
INFO[0062] [controlplane] Redeploying controlplane to update kubeapi parameters
INFO[0062] [etcd] Building up etcd plane..             
INFO[0062] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0062] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0062] Starting container [etcd-fix-perm] on host [10.10.1.127], try #1
INFO[0063] Successfully started [etcd-fix-perm] container on host [10.10.1.127]
INFO[0063] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.127]
INFO[0063] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.127]
INFO[0063] Container [etcd-fix-perm] is still running on host [10.10.1.127]
INFO[0064] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.127]
INFO[0064] Checking if container [etcd] is running on host [10.10.1.127], try #1
INFO[0064] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.127], try #1
INFO[0064] Image [rancher/coreos-etcd:v3.3.10-rancher1] does not exist on host [10.10.1.127]: Error: No such image: rancher/coreos-etcd:v3.3.10-rancher1
INFO[0064] Pulling image [rancher/coreos-etcd:v3.3.10-rancher1] on host [10.10.1.127], try #1
INFO[0069] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.127], try #1
INFO[0069] Image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.127]
INFO[0069] Checking if container [old-etcd] is running on host [10.10.1.127], try #1
INFO[0069] Stopping container [etcd] on host [10.10.1.127] with stopTimeoutDuration [5s], try #1
INFO[0074] Waiting for [etcd] container to exit on host [10.10.1.127]
INFO[0074] Renaming container [etcd] to [old-etcd] on host [10.10.1.127], try #1
INFO[0074] Starting container [etcd] on host [10.10.1.127], try #1
INFO[0075] [etcd] Successfully updated [etcd] container on host [10.10.1.127]
INFO[0075] Removing container [old-etcd] on host [10.10.1.127], try #1
INFO[0075] [etcd] Running rolling snapshot container [etcd-snapshot-once] on host [10.10.1.127]
INFO[0075] Removing container [etcd-rolling-snapshots] on host [10.10.1.127], try #1
INFO[0075] [remove/etcd-rolling-snapshots] Successfully removed container on host [10.10.1.127]
INFO[0075] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0075] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0075] Starting container [etcd-rolling-snapshots] on host [10.10.1.127], try #1
INFO[0075] [etcd] Successfully started [etcd-rolling-snapshots] container on host [10.10.1.127]
INFO[0080] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0080] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0081] Starting container [rke-bundle-cert] on host [10.10.1.127], try #1
INFO[0081] [certificates] Successfully started [rke-bundle-cert] container on host [10.10.1.127]
INFO[0081] Waiting for [rke-bundle-cert] container to exit on host [10.10.1.127]
INFO[0081] Container [rke-bundle-cert] is still running on host [10.10.1.127]
INFO[0082] Waiting for [rke-bundle-cert] container to exit on host [10.10.1.127]
INFO[0082] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [10.10.1.127]
INFO[0082] Removing container [rke-bundle-cert] on host [10.10.1.127], try #1
INFO[0082] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0082] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0082] Starting container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0082] [etcd] Successfully started [rke-log-linker] container on host [10.10.1.127]
INFO[0082] Removing container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0083] [remove/rke-log-linker] Successfully removed container on host [10.10.1.127]
INFO[0083] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0083] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0083] Starting container [etcd-fix-perm] on host [10.10.2.6], try #1
INFO[0083] Successfully started [etcd-fix-perm] container on host [10.10.2.6]
INFO[0083] Waiting for [etcd-fix-perm] container to exit on host [10.10.2.6]
INFO[0083] Waiting for [etcd-fix-perm] container to exit on host [10.10.2.6]
INFO[0083] Container [etcd-fix-perm] is still running on host [10.10.2.6]
INFO[0084] Waiting for [etcd-fix-perm] container to exit on host [10.10.2.6]
INFO[0084] Checking if container [etcd] is running on host [10.10.2.6], try #1
INFO[0084] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.2.6], try #1
INFO[0084] Image [rancher/coreos-etcd:v3.3.10-rancher1] does not exist on host [10.10.2.6]: Error: No such image: rancher/coreos-etcd:v3.3.10-rancher1
INFO[0084] Pulling image [rancher/coreos-etcd:v3.3.10-rancher1] on host [10.10.2.6], try #1
INFO[0088] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.2.6], try #1
INFO[0088] Image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.2.6]
INFO[0088] Checking if container [old-etcd] is running on host [10.10.2.6], try #1
INFO[0088] Stopping container [etcd] on host [10.10.2.6] with stopTimeoutDuration [5s], try #1
INFO[0093] Waiting for [etcd] container to exit on host [10.10.2.6]
INFO[0093] Renaming container [etcd] to [old-etcd] on host [10.10.2.6], try #1
INFO[0093] Starting container [etcd] on host [10.10.2.6], try #1
INFO[0093] [etcd] Successfully updated [etcd] container on host [10.10.2.6]
INFO[0093] Removing container [old-etcd] on host [10.10.2.6], try #1
INFO[0094] [etcd] Running rolling snapshot container [etcd-snapshot-once] on host [10.10.2.6]
INFO[0094] Removing container [etcd-rolling-snapshots] on host [10.10.2.6], try #1
INFO[0094] [remove/etcd-rolling-snapshots] Successfully removed container on host [10.10.2.6]
INFO[0094] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0094] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0094] Starting container [etcd-rolling-snapshots] on host [10.10.2.6], try #1
INFO[0094] [etcd] Successfully started [etcd-rolling-snapshots] container on host [10.10.2.6]
INFO[0099] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0099] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0099] Starting container [rke-bundle-cert] on host [10.10.2.6], try #1
INFO[0100] [certificates] Successfully started [rke-bundle-cert] container on host [10.10.2.6]
INFO[0100] Waiting for [rke-bundle-cert] container to exit on host [10.10.2.6]
INFO[0100] Container [rke-bundle-cert] is still running on host [10.10.2.6]
INFO[0101] Waiting for [rke-bundle-cert] container to exit on host [10.10.2.6]
INFO[0101] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [10.10.2.6]
INFO[0101] Removing container [rke-bundle-cert] on host [10.10.2.6], try #1
INFO[0101] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0101] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0101] Starting container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0101] [etcd] Successfully started [rke-log-linker] container on host [10.10.2.6]
INFO[0101] Removing container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0102] [remove/rke-log-linker] Successfully removed container on host [10.10.2.6]
INFO[0102] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0102] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0102] Starting container [etcd-fix-perm] on host [10.10.1.208], try #1
INFO[0102] Successfully started [etcd-fix-perm] container on host [10.10.1.208]
INFO[0102] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.208]
INFO[0102] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.208]
INFO[0102] Container [etcd-fix-perm] is still running on host [10.10.1.208]
INFO[0103] Waiting for [etcd-fix-perm] container to exit on host [10.10.1.208]
INFO[0103] Checking if container [etcd] is running on host [10.10.1.208], try #1
INFO[0103] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.208], try #1
INFO[0103] Image [rancher/coreos-etcd:v3.3.10-rancher1] does not exist on host [10.10.1.208]: Error: No such image: rancher/coreos-etcd:v3.3.10-rancher1
INFO[0103] Pulling image [rancher/coreos-etcd:v3.3.10-rancher1] on host [10.10.1.208], try #1
INFO[0107] Checking if image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.208], try #1
INFO[0107] Image [rancher/coreos-etcd:v3.3.10-rancher1] exists on host [10.10.1.208]
INFO[0107] Checking if container [old-etcd] is running on host [10.10.1.208], try #1
INFO[0107] Stopping container [etcd] on host [10.10.1.208] with stopTimeoutDuration [5s], try #1
INFO[0112] Waiting for [etcd] container to exit on host [10.10.1.208]
INFO[0112] Renaming container [etcd] to [old-etcd] on host [10.10.1.208], try #1
INFO[0112] Starting container [etcd] on host [10.10.1.208], try #1
INFO[0112] [etcd] Successfully updated [etcd] container on host [10.10.1.208]
INFO[0112] Removing container [old-etcd] on host [10.10.1.208], try #1
INFO[0112] [etcd] Running rolling snapshot container [etcd-snapshot-once] on host [10.10.1.208]
INFO[0112] Removing container [etcd-rolling-snapshots] on host [10.10.1.208], try #1
INFO[0112] [remove/etcd-rolling-snapshots] Successfully removed container on host [10.10.1.208]
INFO[0112] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0112] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0113] Starting container [etcd-rolling-snapshots] on host [10.10.1.208], try #1
INFO[0113] [etcd] Successfully started [etcd-rolling-snapshots] container on host [10.10.1.208]
INFO[0118] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0118] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0118] Starting container [rke-bundle-cert] on host [10.10.1.208], try #1
INFO[0119] [certificates] Successfully started [rke-bundle-cert] container on host [10.10.1.208]
INFO[0119] Waiting for [rke-bundle-cert] container to exit on host [10.10.1.208]
INFO[0119] Container [rke-bundle-cert] is still running on host [10.10.1.208]
INFO[0120] Waiting for [rke-bundle-cert] container to exit on host [10.10.1.208]
INFO[0120] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [10.10.1.208]
INFO[0120] Removing container [rke-bundle-cert] on host [10.10.1.208], try #1
INFO[0120] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0120] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0120] Starting container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0120] [etcd] Successfully started [rke-log-linker] container on host [10.10.1.208]
INFO[0120] Removing container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0120] [remove/rke-log-linker] Successfully removed container on host [10.10.1.208]
INFO[0120] [etcd] Successfully started etcd plane.. Checking etcd cluster health
INFO[0121] [controlplane] Building up Controller Plane..
INFO[0121] Checking if container [service-sidekick] is running on host [10.10.1.208], try #1
INFO[0121] Checking if container [service-sidekick] is running on host [10.10.2.6], try #1
INFO[0121] Checking if container [service-sidekick] is running on host [10.10.1.127], try #1
INFO[0121] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0121] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0121] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0121] Removing container [service-sidekick] on host [10.10.1.208], try #1
INFO[0121] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0121] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0121] Removing container [service-sidekick] on host [10.10.1.127], try #1
INFO[0121] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0121] Removing container [service-sidekick] on host [10.10.2.6], try #1
INFO[0121] [remove/service-sidekick] Successfully removed container on host [10.10.1.127]
INFO[0121] [remove/service-sidekick] Successfully removed container on host [10.10.1.208]
INFO[0121] [remove/service-sidekick] Successfully removed container on host [10.10.2.6]
INFO[0121] Checking if container [kube-apiserver] is running on host [10.10.1.208], try #1
INFO[0121] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208], try #1
INFO[0121] Image [rancher/hyperkube:v1.15.5-rancher1] does not exist on host [10.10.1.208]: Error: No such image: rancher/hyperkube:v1.15.5-rancher1
INFO[0121] Pulling image [rancher/hyperkube:v1.15.5-rancher1] on host [10.10.1.208], try #1
INFO[0121] Checking if container [kube-apiserver] is running on host [10.10.1.127], try #1
INFO[0121] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127], try #1
INFO[0121] Image [rancher/hyperkube:v1.15.5-rancher1] does not exist on host [10.10.1.127]: Error: No such image: rancher/hyperkube:v1.15.5-rancher1
INFO[0121] Pulling image [rancher/hyperkube:v1.15.5-rancher1] on host [10.10.1.127], try #1
INFO[0121] Checking if container [kube-apiserver] is running on host [10.10.2.6], try #1
INFO[0121] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6], try #1
INFO[0121] Image [rancher/hyperkube:v1.15.5-rancher1] does not exist on host [10.10.2.6]: Error: No such image: rancher/hyperkube:v1.15.5-rancher1
INFO[0121] Pulling image [rancher/hyperkube:v1.15.5-rancher1] on host [10.10.2.6], try #1
INFO[0146] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208], try #1
INFO[0146] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208]
INFO[0146] Checking if container [old-kube-apiserver] is running on host [10.10.1.208], try #1
INFO[0146] Stopping container [kube-apiserver] on host [10.10.1.208] with stopTimeoutDuration [5s], try #1
INFO[0146] Waiting for [kube-apiserver] container to exit on host [10.10.1.208]
INFO[0146] Renaming container [kube-apiserver] to [old-kube-apiserver] on host [10.10.1.208], try #1
INFO[0146] Starting container [kube-apiserver] on host [10.10.1.208], try #1
INFO[0146] [controlplane] Successfully updated [kube-apiserver] container on host [10.10.1.208]
INFO[0146] Removing container [old-kube-apiserver] on host [10.10.1.208], try #1
INFO[0146] [healthcheck] Start Healthcheck on service [kube-apiserver] on host [10.10.1.208]
INFO[0147] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127], try #1
INFO[0147] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127]
INFO[0147] Checking if container [old-kube-apiserver] is running on host [10.10.1.127], try #1
INFO[0147] Stopping container [kube-apiserver] on host [10.10.1.127] with stopTimeoutDuration [5s], try #1
INFO[0147] Waiting for [kube-apiserver] container to exit on host [10.10.1.127]
INFO[0147] Renaming container [kube-apiserver] to [old-kube-apiserver] on host [10.10.1.127], try #1
INFO[0147] Starting container [kube-apiserver] on host [10.10.1.127], try #1
INFO[0147] [controlplane] Successfully updated [kube-apiserver] container on host [10.10.1.127]
INFO[0147] Removing container [old-kube-apiserver] on host [10.10.1.127], try #1
INFO[0147] [healthcheck] Start Healthcheck on service [kube-apiserver] on host [10.10.1.127]
INFO[0152] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6], try #1
INFO[0152] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6]
INFO[0152] Checking if container [old-kube-apiserver] is running on host [10.10.2.6], try #1
INFO[0152] Stopping container [kube-apiserver] on host [10.10.2.6] with stopTimeoutDuration [5s], try #1
INFO[0152] Waiting for [kube-apiserver] container to exit on host [10.10.2.6]
INFO[0152] Renaming container [kube-apiserver] to [old-kube-apiserver] on host [10.10.2.6], try #1
INFO[0152] Starting container [kube-apiserver] on host [10.10.2.6], try #1
INFO[0153] [controlplane] Successfully updated [kube-apiserver] container on host [10.10.2.6]
INFO[0153] Removing container [old-kube-apiserver] on host [10.10.2.6], try #1
INFO[0153] [healthcheck] Start Healthcheck on service [kube-apiserver] on host [10.10.2.6]
INFO[0157] [healthcheck] service [kube-apiserver] on host [10.10.1.208] is healthy
INFO[0157] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0157] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0158] Starting container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0158] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.208]
INFO[0158] Removing container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0158] [remove/rke-log-linker] Successfully removed container on host [10.10.1.208]
INFO[0158] Checking if container [kube-controller-manager] is running on host [10.10.1.208], try #1
INFO[0158] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208], try #1
INFO[0158] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208]
INFO[0158] Checking if container [old-kube-controller-manager] is running on host [10.10.1.208], try #1
INFO[0158] Stopping container [kube-controller-manager] on host [10.10.1.208] with stopTimeoutDuration [5s], try #1
INFO[0158] [healthcheck] service [kube-apiserver] on host [10.10.1.127] is healthy
INFO[0158] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0158] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0158] Starting container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0159] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.127]
INFO[0159] Removing container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0159] [remove/rke-log-linker] Successfully removed container on host [10.10.1.127]
INFO[0159] Checking if container [kube-controller-manager] is running on host [10.10.1.127], try #1
INFO[0159] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127], try #1
INFO[0159] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127]
INFO[0159] Checking if container [old-kube-controller-manager] is running on host [10.10.1.127], try #1
INFO[0159] Stopping container [kube-controller-manager] on host [10.10.1.127] with stopTimeoutDuration [5s], try #1
INFO[0163] Waiting for [kube-controller-manager] container to exit on host [10.10.1.208]
INFO[0163] Renaming container [kube-controller-manager] to [old-kube-controller-manager] on host [10.10.1.208], try #1
INFO[0163] Starting container [kube-controller-manager] on host [10.10.1.208], try #1
INFO[0164] [controlplane] Successfully updated [kube-controller-manager] container on host [10.10.1.208]
INFO[0164] Removing container [old-kube-controller-manager] on host [10.10.1.208], try #1
INFO[0164] [healthcheck] Start Healthcheck on service [kube-controller-manager] on host [10.10.1.208]
INFO[0164] Waiting for [kube-controller-manager] container to exit on host [10.10.1.127]
INFO[0164] Renaming container [kube-controller-manager] to [old-kube-controller-manager] on host [10.10.1.127], try #1
INFO[0164] Starting container [kube-controller-manager] on host [10.10.1.127], try #1
INFO[0165] [controlplane] Successfully updated [kube-controller-manager] container on host [10.10.1.127]
INFO[0165] Removing container [old-kube-controller-manager] on host [10.10.1.127], try #1
INFO[0165] [healthcheck] Start Healthcheck on service [kube-controller-manager] on host [10.10.1.127]
INFO[0166] [healthcheck] service [kube-apiserver] on host [10.10.2.6] is healthy
INFO[0166] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0166] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0166] Starting container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0166] [controlplane] Successfully started [rke-log-linker] container on host [10.10.2.6]
INFO[0166] Removing container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0166] [remove/rke-log-linker] Successfully removed container on host [10.10.2.6]
INFO[0166] Checking if container [kube-controller-manager] is running on host [10.10.2.6], try #1
INFO[0166] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6], try #1
INFO[0166] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6]
INFO[0166] Checking if container [old-kube-controller-manager] is running on host [10.10.2.6], try #1
INFO[0166] Stopping container [kube-controller-manager] on host [10.10.2.6] with stopTimeoutDuration [5s], try #1
INFO[0169] [healthcheck] service [kube-controller-manager] on host [10.10.1.208] is healthy
INFO[0169] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0169] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0170] Starting container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0170] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.208]
INFO[0170] Removing container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0170] [remove/rke-log-linker] Successfully removed container on host [10.10.1.208]
INFO[0170] Checking if container [kube-scheduler] is running on host [10.10.1.208], try #1
INFO[0170] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208], try #1
INFO[0170] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.208]
INFO[0170] Checking if container [old-kube-scheduler] is running on host [10.10.1.208], try #1
INFO[0170] Stopping container [kube-scheduler] on host [10.10.1.208] with stopTimeoutDuration [5s], try #1
INFO[0170] [healthcheck] service [kube-controller-manager] on host [10.10.1.127] is healthy
INFO[0170] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0170] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0170] Starting container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0171] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.127]
INFO[0171] Removing container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0171] [remove/rke-log-linker] Successfully removed container on host [10.10.1.127]
INFO[0171] Checking if container [kube-scheduler] is running on host [10.10.1.127], try #1
INFO[0171] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127], try #1
INFO[0171] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.1.127]
INFO[0171] Checking if container [old-kube-scheduler] is running on host [10.10.1.127], try #1
INFO[0171] Stopping container [kube-scheduler] on host [10.10.1.127] with stopTimeoutDuration [5s], try #1
INFO[0171] Waiting for [kube-controller-manager] container to exit on host [10.10.2.6]
INFO[0171] Renaming container [kube-controller-manager] to [old-kube-controller-manager] on host [10.10.2.6], try #1
INFO[0171] Starting container [kube-controller-manager] on host [10.10.2.6], try #1
INFO[0172] [controlplane] Successfully updated [kube-controller-manager] container on host [10.10.2.6]
INFO[0172] Removing container [old-kube-controller-manager] on host [10.10.2.6], try #1
INFO[0172] [healthcheck] Start Healthcheck on service [kube-controller-manager] on host [10.10.2.6]
INFO[0175] Waiting for [kube-scheduler] container to exit on host [10.10.1.208]
INFO[0175] Renaming container [kube-scheduler] to [old-kube-scheduler] on host [10.10.1.208], try #1
INFO[0175] Starting container [kube-scheduler] on host [10.10.1.208], try #1
INFO[0176] [controlplane] Successfully updated [kube-scheduler] container on host [10.10.1.208]
INFO[0176] Removing container [old-kube-scheduler] on host [10.10.1.208], try #1
INFO[0176] [healthcheck] Start Healthcheck on service [kube-scheduler] on host [10.10.1.208]
INFO[0176] Waiting for [kube-scheduler] container to exit on host [10.10.1.127]
INFO[0176] Renaming container [kube-scheduler] to [old-kube-scheduler] on host [10.10.1.127], try #1
INFO[0176] Starting container [kube-scheduler] on host [10.10.1.127], try #1
INFO[0177] [controlplane] Successfully updated [kube-scheduler] container on host [10.10.1.127]
INFO[0177] Removing container [old-kube-scheduler] on host [10.10.1.127], try #1
INFO[0177] [healthcheck] Start Healthcheck on service [kube-scheduler] on host [10.10.1.127]
INFO[0178] [healthcheck] service [kube-controller-manager] on host [10.10.2.6] is healthy
INFO[0178] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0178] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0178] Starting container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0178] [controlplane] Successfully started [rke-log-linker] container on host [10.10.2.6]
INFO[0178] Removing container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0178] [remove/rke-log-linker] Successfully removed container on host [10.10.2.6]
INFO[0178] Checking if container [kube-scheduler] is running on host [10.10.2.6], try #1
INFO[0178] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6], try #1
INFO[0178] Image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6]
INFO[0178] Checking if container [old-kube-scheduler] is running on host [10.10.2.6], try #1
INFO[0178] Stopping container [kube-scheduler] on host [10.10.2.6] with stopTimeoutDuration [5s], try #1
INFO[0181] [healthcheck] service [kube-scheduler] on host [10.10.1.208] is healthy
INFO[0181] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208], try #1
INFO[0181] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.208]
INFO[0181] Starting container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0182] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.208]
INFO[0182] Removing container [rke-log-linker] on host [10.10.1.208], try #1
INFO[0182] [remove/rke-log-linker] Successfully removed container on host [10.10.1.208]
INFO[0182] [healthcheck] service [kube-scheduler] on host [10.10.1.127] is healthy
INFO[0182] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127], try #1
INFO[0182] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.1.127]
INFO[0182] Starting container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0183] [controlplane] Successfully started [rke-log-linker] container on host [10.10.1.127]
INFO[0183] Removing container [rke-log-linker] on host [10.10.1.127], try #1
INFO[0183] [remove/rke-log-linker] Successfully removed container on host [10.10.1.127]
INFO[0183] Waiting for [kube-scheduler] container to exit on host [10.10.2.6]
INFO[0183] Renaming container [kube-scheduler] to [old-kube-scheduler] on host [10.10.2.6], try #1
INFO[0183] Starting container [kube-scheduler] on host [10.10.2.6], try #1
INFO[0184] [controlplane] Successfully updated [kube-scheduler] container on host [10.10.2.6]
INFO[0184] Removing container [old-kube-scheduler] on host [10.10.2.6], try #1
INFO[0184] [healthcheck] Start Healthcheck on service [kube-scheduler] on host [10.10.2.6]
INFO[0189] [healthcheck] service [kube-scheduler] on host [10.10.2.6] is healthy
INFO[0189] Checking if image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6], try #1
INFO[0189] Image [rancher/rke-tools:v0.1.50] exists on host [10.10.2.6]
INFO[0189] Starting container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0190] [controlplane] Successfully started [rke-log-linker] container on host [10.10.2.6]
INFO[0190] Removing container [rke-log-linker] on host [10.10.2.6], try #1
INFO[0190] [remove/rke-log-linker] Successfully removed container on host [10.10.2.6]
INFO[0190] [controlplane] Successfully started Controller Plane..
INFO[0190] [controlplane] Restarting the Controller Plane..
INFO[0190] Restarting container [kube-apiserver] on host [10.10.1.208], try #1
INFO[0190] Restarting container [kube-apiserver] on host [10.10.1.127], try #1
INFO[0190] Restarting container [kube-apiserver] on host [10.10.2.6], try #1
INFO[0191] [restart/kube-apiserver] Successfully restarted container on host [10.10.1.208]
INFO[0191] Restarting container [kube-controller-manager] on host [10.10.1.208], try #1
INFO[0191] [restart/kube-apiserver] Successfully restarted container on host [10.10.1.127]
INFO[0191] Restarting container [kube-controller-manager] on host [10.10.1.127], try #1
INFO[0191] [restart/kube-apiserver] Successfully restarted container on host [10.10.2.6]
INFO[0191] Restarting container [kube-controller-manager] on host [10.10.2.6], try #1
INFO[0191] [restart/kube-controller-manager] Successfully restarted container on host [10.10.1.208]
INFO[0191] Restarting container [kube-scheduler] on host [10.10.1.208], try #1
INFO[0191] [restart/kube-controller-manager] Successfully restarted container on host [10.10.2.6]
INFO[0191] Restarting container [kube-scheduler] on host [10.10.2.6], try #1
INFO[0191] [restart/kube-controller-manager] Successfully restarted container on host [10.10.1.127]
INFO[0191] Restarting container [kube-scheduler] on host [10.10.1.127], try #1
INFO[0192] [restart/kube-scheduler] Successfully restarted container on host [10.10.1.208]
INFO[0193] [restart/kube-scheduler] Successfully restarted container on host [10.10.2.6]
INFO[0193] [restart/kube-scheduler] Successfully restarted container on host [10.10.1.127]
INFO[0193] [controlplane] Successfully restarted Controller Plane..
INFO[0193] [worker] Restarting Worker Plane..          
INFO[0193] Restarting container [kubelet] on host [10.10.1.127], try #1
INFO[0193] Restarting container [kubelet] on host [10.10.1.208], try #1
INFO[0193] Restarting container [kubelet] on host [10.10.2.6], try #1
INFO[0193] [restart/kubelet] Successfully restarted container on host [10.10.1.208]
INFO[0193] Restarting container [kube-proxy] on host [10.10.1.208], try #1
INFO[0193] [restart/kubelet] Successfully restarted container on host [10.10.1.127]
INFO[0193] Restarting container [kube-proxy] on host [10.10.1.127], try #1
INFO[0194] [restart/kubelet] Successfully restarted container on host [10.10.2.6]
INFO[0194] Restarting container [kube-proxy] on host [10.10.2.6], try #1
INFO[0198] [restart/kube-proxy] Successfully restarted container on host [10.10.1.208]
INFO[0199] [restart/kube-proxy] Successfully restarted container on host [10.10.1.127]
INFO[0199] [restart/kube-proxy] Successfully restarted container on host [10.10.2.6]
INFO[0199] [worker] Successfully restarted Worker Plane.. 

A closer look at the output shows, that a newer Kubernetes version (1.15) was installed, too:

INFO[0121] Checking if image [rancher/hyperkube:v1.15.5-rancher1] exists on host [10.10.2.6], try #1 

The rke command finished without error and a login into Rancher 2 UI was possible again. The Rancher cluster is back!

Where are the new certificates?

On the Kubernetes cluster nodes, the certificates were re-created in /etc/kubernetes/ssl:

root@onl-ranx01-p:~# ll /etc/kubernetes/ssl/
total 120
-rw------- 1 root root 1675 Nov  8 16:21 kube-apiserver-key.pem
-rw------- 1 root root 1261 Nov  8 16:21 kube-apiserver.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-apiserver-proxy-client-key.pem
-rw------- 1 root root 1107 Nov  8 16:21 kube-apiserver-proxy-client.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-apiserver-requestheader-ca-key.pem
-rw------- 1 root root 1082 Nov  8 16:21 kube-apiserver-requestheader-ca.pem
-rw------- 1 root root 1675 Nov  8 16:21 kube-ca-key.pem
-rw------- 1 root root 1017 Nov  8 16:21 kube-ca.pem
-rw-r--r-- 1 root root  517 Nov  8  2018 kubecfg-kube-apiserver-proxy-client.yaml
-rw-r--r-- 1 root root  533 Nov  8  2018 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw-r--r-- 1 root root  501 Nov  8  2018 kubecfg-kube-controller-manager.yaml
-rw-r--r-- 1 root root  445 Nov  8  2018 kubecfg-kube-node.yaml
-rw-r--r-- 1 root root  449 Nov  8  2018 kubecfg-kube-proxy.yaml
-rw-r--r-- 1 root root  465 Nov  8  2018 kubecfg-kube-scheduler.yaml
-rw------- 1 root root 1679 Nov  8 16:21 kube-controller-manager-key.pem
-rw------- 1 root root 1062 Nov  8 16:21 kube-controller-manager.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-etcd-10-10-1-127-key.pem
-rw------- 1 root root 1253 Nov  8 16:21 kube-etcd-10-10-1-127.pem
-rw------- 1 root root 1675 Nov  8 16:21 kube-etcd-10-10-1-208-key.pem
-rw------- 1 root root 1253 Nov  8 16:21 kube-etcd-10-10-1-208.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-etcd-10-10-2-6-key.pem
-rw------- 1 root root 1253 Nov  8 16:21 kube-etcd-10-10-2-6.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-node-key.pem
-rw------- 1 root root 1070 Nov  8 16:21 kube-node.pem
-rw------- 1 root root 1679 Nov  8 16:21 kube-proxy-key.pem
-rw------- 1 root root 1046 Nov  8 16:21 kube-proxy.pem
-rw------- 1 root root 1675 Nov  8 16:21 kube-scheduler-key.pem
-rw------- 1 root root 1050 Nov  8 16:21 kube-scheduler.pem
-rw------- 1 root root 1675 Nov  8 16:21 kube-service-account-token-key.pem
-rw------- 1 root root 1261 Nov  8 16:21 kube-service-account-token.pem

With the openssl command, the new certificates can be checked (here again the kube-apiserver-requestheader-ca.pem):

root@onl-ranx01-p:~# openssl x509 -text -in /etc/kubernetes/ssl/kube-apiserver-requestheader-ca.pem | grep -A 2 Validity
        Validity
            Not Before: Nov  8 14:02:59 2018 GMT
            Not After : Nov  5 14:02:59 2028 GMT

This time the certificates are valid for the next 10 years.

What happens to the containers?

Problably the first thought in such a situation is: "is my application still running?". The good news is: Yes, it is. The containers (pods) themselves continue to run even though the Kubernetes cluster is not operational anymore. As long as the pods are not touched in any way (e.g. forced reboot through the command line on the host) they continue to work. Of course this also means deployments are not possible while the Kubernetes cluster is broken.

But what about Kubernetes clusters created by Rancher?

The rke cert rotate method only applies to clusters which were created with the rke command - mainly the Kubernetes cluster hosting the Rancher 2 management. But the certificates can also expire on Kubernetes clusters created by Rancher 2 (in the UI).

Starting with Rancher 2.2.x, each Rancher created Kubernetes cluster has an option to rotate the certificates. This can be found in the "Global" view of all clusters. A click on the "three dot" icon shows the "Rotate Certificates" option:

Rotate Kubernetes certificates in Rancher 2 UI

But if Rancher 2 still runs on an older version (2.0.x or 2.1.x) your only way is to upgrade Rancher to a version which added this "Rotate Certificates" option. For the 2.0.x release this would be 2.0.15, for the 2.1.x release it is version 2.1.10. Suggested however is to upgrade to Rancher 2.2.4 or newer.

A specific release can be installed using helm:

helm upgrade --version 2.2.8 rancher rancher-stable/rancher --set hostname=rancher2.example.com --set ingress.tls.source=secret

For more information how to perform a Rancher 2 upgrade, see Upgrade a Rancher 2 HA management cluster with helm.

How to monitor the Kubernetes certificates?

Inside Rancher 2 there is no information about the soon to expire certificates so how does one monitor the certificates?

As they are running on every node as Kubelet API, the certificates can be checked using a simple https connection to port 6443:

root@linux:~# curl https://10.10.45.11:6443 -v -k
* Rebuilt URL to: https://10.10.45.11:6443/
*   Trying 10.10.45.11...
* Connected to 10.10.45.11 (10.10.45.11) port 6443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*      server certificate verification SKIPPED
*      server certificate status verification SKIPPED
*      common name: node1@1541691349 (does not match '10.10.45.11')
*      server certificate expiration date FAILED
*      server certificate activation date OK
*      certificate public key: RSA
*      certificate version: #3
*      subject: CN=node1@1541691349
*      start date: Thu, 08 Nov 2018 15:35:49 GMT
*      expire date: Fri, 08 Nov 2019 15:35:49 GMT
*      issuer: CN=node1-ca@1541691348
*      compression: NULL
* ALPN, server accepted to use http/1.1
> GET / HTTP/1.1
> Host: 10.10.45.11:6443
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Content-Type: application/json
< Date: Mon, 11 Nov 2019 09:43:32 GMT
< Content-Length: 165
<
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
* Connection #0 to host 10.10.45.11 left intact
}

With this information, check_http can be used to check the certificate validity:

root@linux:~# /usr/lib/nagios/plugins/check_http -I 10.10.45.11 -p 6443 -S -C 14,7
CRITICAL - Certificate 'kube-apiserver' expired on Fri 08 Nov 2019 03:35:00 PM CET.