A Rancher 2 upgrade gone bad: Management cluster down (a.k.a. you really need to ditch helm v2 and use helm v3)

Published on February 9th 2021

While testing multiple upgrade scenarios on a Rancher Kubernetes management cluster (alias the "local" cluster), one of the tested methods was to use the (meanwhile) outdated helm v2 as Rancher deployment tool.

The official Rancher documentation still keeps an upgrade guide for helm2, however it is mentioned that helm 3 should be used:

Helm 3 has been released. If you are using Helm 2, we recommend migrating to Helm 3 because it is simpler to use and more secure than Helm 2.
This section provides a copy of the older instructions for upgrading Rancher with Helm 2, and it is intended to be used if upgrading to Helm 3 is not feasible.

No word in the documentation that Helm 2 could do any harm - so we've tried it. And broke the cluster.

The situation

A Rancher management cluster (in the Rancher user interface named as "local" cluster) running with Rancher 2.2.8 and Kubernetes 1.15.12 needed to be upgraded to Rancher 2.4.x. According to the Rancher support and maintenance matrix, Rancher 2.4.x requires a minimum Kubernetes version of 1.15.12 - which matches the current Kubernetes version.

We have done a couple of Rancher upgrades already in the past and we even wrote an article (Upgrade a Rancher 2 HA management cluster with helm) about it. We basically followed our own documentation with the additional information from the Rancher 2 upgrade guide for helm 2.

Problems with helm

Obviously the first step is always to create a backup. Whether this is written in a documentation or not, this is definitely a must. As this was a test environment, we simply did a snapshot of the current filesystem (of all nodes). In case of problems we would start the nodes from the snapshot again.

The next step in the official documentation is to update the helm repository but here we ran into the first issue:

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Skip local chart repository
...Successfully got an update from the "rancher-stable" chart repository
...Unable to get an update from the "stable" chart repository (https://kubernetes-charts.storage.googleapis.com):
Failed to fetch https://kubernetes-charts.storage.googleapis.com/index.yaml : 403 Forbidden
Update Complete.

According to the documentation, helm repo list should reveal the following repositories in the output:

helm repo list output according to Rancher documentation

This was confirmed:

$ helm repo list
NAME            URL
stable          https://kubernetes-charts.storage.googleapis.com
local           http://127.0.0.1:8879/charts
rancher-stable https://releases.rancher.com/server-charts/stable

However still doesn't explain the errors from above.

Note: This will be explained later in this article. Keep on reading ;-)

Anyway, we decided to proceed with the procedure and tried to retrieve the current values from the rancher deployment. But here we ran into the next problem:

$ helm get values rancher
Error: release: "rancher" not found

For an unknown reason, the current Rancher 2 deployment was not deployed under the name "rancher" but with a different name. This can be verified using helm list:

$ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
alliterating-echidna 1 Fri Dec 13 09:52:12 2019 DEPLOYED rancher-2.2.8 v2.2.8 cattle-system

Note: By default Rancher should be deployed as "rancher".

The values can now be retrieved using this helm deployment name (alliterating-echidna):

$ helm get values alliterating-echidna
hostname: rancher2-test.example.com
ingress:
tls:
source: secret

Rancher 2 re-installation fails (oops-a-daisy)

When upgrading Rancher, there are two possibilities:

A) Upgrade Rancher

B) Delete and re-install Rancher

As we wanted to get rid of the weird name and install Rancher under the correct "rancher" name, we decided to go with option B.

$ helm delete alliterating-echidna
release "alliterating-echidna" deleted

When we wanted to verify that the helm deployment was correctly deleted, we ran into another error:

$ helm list
Error: {"Code":{"Code":"Forbidden","Status":403},"Message":"users \"u-3uhgbtaxx2\" is forbidden: User \"system:serviceaccount:cattle-system:alliterating-echidna-rancher\" cannot impersonate resource \"users\" in API group \"\" at the cluster scope","Cause":null,"FieldName":""} (get pods)

We expected just an empty list as output, not a 403 error. Anyway, we continued with the Rancher 2 installation:

$ helm install rancher-stable/rancher --version 2.4.13 --namespace cattle-system --set hostname=rancher2-test.example.com --set ingress.tls.source=secret
Error: the server could not find the requested resource (get pods)

Wow. At this point the Kubernetes cluster stopped returning needed information.

A look at the Rancher UI in the browser revealed that only a "404 - default backend" was showing up. That's the Kubernetes' Ingress responding to the request, instead of forwarding it to any Rancher container.

We tried to fix the Kubernetes cluster with another run of rke, but even this resulted in an error:

$ ./rke up --config RANCHER2_TEST/3-node-rancher-test.yml
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
[...]
INFO[0239] [worker] Successfully started Worker Plane..
INFO[0239] [controlplane] Now checking status of node 10.10.44.12, try #1
ERRO[0264] Host 10.10.44.12 failed to report Ready status with error: [controlplane] Error getting node 10.10.44.12: "10.10.44.12" not found
INFO[0264] [controlplane] Now checking status of node 10.10.44.13, try #1
ERRO[0289] Host 10.10.44.13 failed to report Ready status with error: [controlplane] Error getting node 10.10.44.13: "10.10.44.13" not found
INFO[0289] [controlplane] Now checking status of node 10.10.44.14, try #1
ERRO[0314] Host 10.10.44.14 failed to report Ready status with error: [controlplane] Error getting node 10.10.44.14: "10.10.44.14" not found
INFO[0314] [controlplane] Processing controlplane hosts for upgrade 1 at a time
INFO[0314] Processing controlplane host 10.10.44.12
INFO[0314] [controlplane] Now checking status of node 10.10.44.12, try #1
ERRO[0339] Failed to upgrade hosts: 10.10.44.12 with error [[controlplane] Error getting node 10.10.44.12: "10.10.44.12" not found]
FATA[0339] [controlPlane] Failed to upgrade Control Plane: [[[controlplane] Error getting node 10.10.44.12: "10.10.44.12" not found]]

At this time we officially declared this Rancher 2 management cluster as kaputt.

We restored and restarted the cluster nodes from the snapshots and went back to Rancher 2.2.8.

Helm v2 to v3 migration

Now we were starting the upgrade from scratch again and this time we chose to migrate to helm v3 first.

helm3 was downloaded, unpacked and located as /usr/local/bin/helm3:

$ wget https://get.helm.sh/helm-v3.5.2-linux-amd64.tar.gz
$ tar -xzf helm-v3.5.2-linux-amd64.tar.gz
$ cd linux-amd64/
$ chmod 755 helm
$ mv helm helm3
$ sudo cp -p helm3 /usr/local/bin/

The helm3 repo list is obviously empty. This helm v2 configuration first needs to be migrated (moved) to v3. To do this, there's a helm plugin (2to3) which can be installed and used.

$ helm3 plugin install https://github.com/helm/helm-2to3
Downloading and installing helm-2to3 v0.8.1 ...
https://github.com/helm/helm-2to3/releases/download/v0.8.1/helm-2to3_0.8.1_linux_amd64.tar.gz
Installed plugin: 2to3

$ helm3 plugin list
NAME VERSION DESCRIPTION
2to3 0.8.1 migrate and cleanup Helm v2 configuration and releases in-place to Helm v3

The v2 configs can now be migrated (moved) to v3:

$ helm3 2to3 move config
2021/02/08 13:38:54 WARNING: Helm v3 configuration may be overwritten during this operation.
2021/02/08 13:38:54
[Move config/confirm] Are you sure you want to move the v2 configuration? [y/N]: y
2021/02/08 13:39:06
Helm v2 configuration will be moved to Helm v3 configuration.
2021/02/08 13:39:06 [Helm 2] Home directory: /home/admin/.helm
2021/02/08 13:39:06 [Helm 3] Config directory: /home/admin/.config/helm
2021/02/08 13:39:06 [Helm 3] Data directory: /home/admin/.local/share/helm
2021/02/08 13:39:06 [Helm 3] Cache directory: /home/admin/.cache/helm
2021/02/08 13:39:06 [Helm 3] Create config folder "/home/admin/.config/helm" .
2021/02/08 13:39:06 [Helm 2] repositories file "/home/admin/.helm/repository/repositories.yaml" will copy to [Helm 3] config folder "/home/admin/.config/helm/repositories.yaml" .
2021/02/08 13:39:06 [Helm 3] Create cache folder "/home/admin/.cache/helm" .
2021/02/08 13:39:06 [Helm 3] Create data folder "/home/admin/.local/share/helm" .
2021/02/08 13:39:06 [Helm 2] starters "/home/admin/.helm/starters" will copy to [Helm 3] data folder "/home/admin/.local/share/helm/starters" .

And the repository list should be the same as on helm v2:

$ helm3 repo list
WARNING: "kubernetes-charts.storage.googleapis.com" is deprecated for "stable" and will be deleted Nov. 13, 2020.
WARNING: You should switch to "https://charts.helm.sh/stable" via:
WARNING: helm repo add "stable" "https://charts.helm.sh/stable" --force-update
NAME            URL
stable          https://kubernetes-charts.storage.googleapis.com
local           http://127.0.0.1:8879/charts
rancher-stable https://releases.rancher.com/server-charts/stable

Oh, wow! This is actually the first helpful hint concerning the kubernetes-charts repository! No wonder the 403 error showed up on helm v2 when the repositories should be updated (the repository was deleted at the end of 2020). Let's replace this removed repository with the new one as suggested by helm3:

$ helm3 repo add "stable" "https://charts.helm.sh/stable" --force-update
WARNING: "kubernetes-charts.storage.googleapis.com" is deprecated for "stable" and will be deleted Nov. 13, 2020.
WARNING: You should switch to "https://charts.helm.sh/stable" via:
WARNING: helm repo add "stable" "https://charts.helm.sh/stable" --force-update
"stable" has been added to your repositories

$ helm3 repo list
NAME            URL
stable          https://charts.helm.sh/stable
local           http://127.0.0.1:8879/charts
rancher-stable https://releases.rancher.com/server-charts/stable

What about a repo update with this new repository?

$ helm3 repo update
Hang tight while we grab the latest from your chart repositories...
...Unable to get an update from the "local" chart repository (http://127.0.0.1:8879/charts):
Get "http://127.0.0.1:8879/charts/index.yaml": dial tcp 127.0.0.1:8879: connect: connection refused
...Successfully got an update from the "rancher-stable" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete. Happy Helming!

Finally, no errors!

Now the helm "release" needs to be migrated, too. From helm list we know the name of the release (alliterating-echidna) and the namespace (cattle-system) this release was deployed to:

$ helm list
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
alliterating-echidna 1 Fri Dec 13 09:52:12 2019 DEPLOYED rancher-2.2.8 v2.2.8 cattle-system

To migrate this release to helm3:

$ helm3 2to3 convert alliterating-echidna
2021/02/16 12:03:23 Release "alliterating-echidna" will be converted from Helm v2 to Helm v3.
2021/02/16 12:03:23 [Helm 3] Release "alliterating-echidna" will be created.
[...]
2021/02/16 12:03:23 [Helm 3] Release "alliterating-echidna" created.
2021/02/16 12:03:23 Release "alliterating-echidna" was converted successfully from Helm v2 to Helm v3.
2021/02/16 12:03:23 Note: The v2 release information still remains and should be removed to avoid conflicts with the migrated v3 release.
2021/02/16 12:03:23 v2 release information should only be removed using `helm 2to3` cleanup and when all releases have been migrated over.

Now helm3 list can be used (in combination with the namespace) to find the helm deployed Rancher release:

$ helm3 list -n cattle-system
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
alliterating-echidna cattle-system 1 2019-12-13 09:52:12.505181994 +0000 UTC deployed rancher-2.2.8 v2.2.8

Rancher 2 upgrade using helm v3

The syntax of helm v3 is slightly different. The most important change to keep in mind is that the -n parameter (to define the namespace) needs to be used. Using this, the configuration values of the Rancher helm deployment (still using the weird name from above) can be retrieved:

$ helm3 get values alliterating-echidna -n cattle-system
USER-SUPPLIED VALUES:
hostname: rancher2-test.example.com
ingress:
tls:
source: secret

And the deployment can be ugpraded using these values:

$ helm3 upgrade alliterating-echidna rancher-stable/rancher --version 2.4.13 --namespace cattle-system --set hostname=rancher2-test.example.com --set ingress.tls.source=secret
Release "alliterating-echidna" has been upgraded. Happy Helming!
NAME: alliterating-echidna
LAST DEPLOYED: Tue Feb 9 09:49:15 2021
NAMESPACE: cattle-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
Rancher Server has been installed.

NOTE: Rancher may take several minutes to fully initialize. Please standby while Certificates are being issued and Ingress comes up.

Check out our docs at https://rancher.com/docs/rancher/v2.x/en/

Browse to https://rancher2-test.example.com

Happy Containering!

This time no error was showing up! That looks much better - at least in the terminal. What about the Rancher UI?

After opening to the browser, it took around one minute and the Rancher version in the bottom left changed to 2.4.13:

All the Rancher managed Kubernetes clusters were marked as unavailable for a few minutes as they needed to restart their services to speak with the new Rancher version. But after a couple of minutes, all the clusters became available and active again.

TL;DR: Use helm v3

Although Rancher still keeps helm v2 in the documentation (with outdated information such as the old kubernetes-charts repository), helm v3 should definitely be used. The official helm v2 to v3 migration guide by Helm looks more confusing than it actually is.

Infiniroot Blog: We sometimes write, too.