How to fix a Rancher 2 cluster node stuck registering in downstream cluster (not shown in UI)

Published on January 10th 2022

Having a working monitoring, you can rely on, is key for a production environment. But monitoring should not just be there to return green or red, OK or CRITICAL or 0/1. A well-implemented monitoring also reduces troubleshooting time, already pointing into the direction where a problem occurs.

And sometimes a good monitoring can also detect broken things, which are not shown by the application itself.

Cluster node stuck in "is registering" phase

This happened a few weeks ago with a Kubernetes cluster, managed by Rancher 2. As we are using the open-source monitoring plugin check_rancher2 for Rancher managed Kubernetes clusters, our monitoring started to alert about a node being stuck in cluster registering phase:

Rancher 2 node is stuck in cluster registering phase

On the command line, the output looks like this:

$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 CRITICAL - null in cluster c-zs42v is registering -|'nodes_total'=67;;;; 'node_errors'=1;;;; 'node_ignored'=0;;;;

There are a couple of eyebrows which went up when this alert appeared. Why is the node's name set to "null" instead of a real host name? Why is this particular node stuck in "registering" phase? And why does this not show up in the Rancher 2 user interface?

At least the cluster name is shown by check_rancher2, so we have an additional hint to follow. By using the -t info check type, all Kubernetes clusters (managed by this Rancher 2 setup) can be listed:

$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t info
CHECK_RANCHER2 OK - Found 9 clusters: c-6p529 alias me-prod - c-dzfvn alias prod-ext - c-gsczw alias aws-prod - c-hmgcp alias prod-int - c-pls9j alias vamp - c-s2c8b alias gamma - c-xjvzp alias et-prod - c-zhsdr alias azure-prod - local alias local - and 25 projects: [...] |'clusters'=9;;;; 'projects'=25;;;;

The important part: There is no such cluster with the ID c-zs42v!

Searching for a missing cluster

By running kubectl against the Rancher 2 (local) cluster, additional information from the Kubernetes API can be retrieved. In this particular situation we focus on the namespaces, as each cluster created by Rancher 2 (RKE) also creates a namespace in Kubernetes:

$ kubectl get ns
NAME                                         STATUS   AGE
c-6p529                                      Active   482d
c-dzfvn                                      Active   138d
c-gsczw                                      Active   524d
c-hmgcp                                      Active   54d
c-jfxkq                                      Active   54d
c-pls9j                                      Active   606d
c-s2c8b                                      Active   3y15d
c-xjvzp                                      Active   628d
c-zhsdr                                      Active   523d
c-zs42v                                      Active   55d
cattle-global-data                           Active   2y12d
cattle-global-nt                             Active   273d
cattle-system                                Active   3y15d
cluster-fleet-default-c-6p529-0a63de8fc176   Active   17m
cluster-fleet-default-c-dzfvn-db0ece01cc3b   Active   17m
cluster-fleet-default-c-gsczw-b67c2a857200   Active   17m
cluster-fleet-default-c-hmgcp-684fbe9142cb   Active   17m
cluster-fleet-default-c-pls9j-b8ab525e0c29   Active   17m
cluster-fleet-default-c-s2c8b-4e26ad7ae3c1   Active   17m
cluster-fleet-default-c-xjvzp-0b65f14fef6c   Active   17m
cluster-fleet-default-c-zhsdr-955f3b1ac907   Active   17m
cluster-fleet-local-local-1a3d67d0a899       Active   17m
default                                      Active   3y15d
fleet-clusters-system                        Active   18m
fleet-default                                Active   17m
fleet-local                                  Active   17m
[...]

All the known cluster IDs (seen before with the -t info check of check_rancher2) show up. But one more cluster shows up: c-zs42v. Our missing cluster!

As we know from check_rancher2, there is a node stuck (trying) register in this cluster. By looking at the cluster registration tokens of this namespace, we can find out which user launched this operation:

$ kubectl get clusterregistrationtokens.management.cattle.io --namespace c-zs42v -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "management.cattle.io/v3",
            "kind": "ClusterRegistrationToken",
            "metadata": {
                "annotations": {
                    "field.cattle.io/creatorId": "u-buuctqjhrm"
                },
                "creationTimestamp": "2021-09-28T12:08:57Z",
                "generateName": "crt-",
                "generation": 1,
                "labels": {
                    "cattle.io/creator": "norman"
                },
                "managedFields": [
[...]

With this information we now have an exact timestamp (creationTimestamp) and the user id (creatorId).

After the user ID could be matched to another cluster administrator, we asked this user what happened on that day. It turned out that this person tried to create a new cluster in Rancher 2 but forgot to create required security groups (firewall rules). This led to a cluster in failed state, unable to actually deploy Kubernetes. The user then deleted the cluster in the Rancher 2 user interface. As the cluster disappeared, the user thought all is good and went on to create another cluster (this time successfully).

But - as our monitoring shows - something was still happening in the background. We know the reason why - but we still need to clean this up.

Node lookup and deletion via Rancher 2 API

The check_rancher2 monitoring plugin reads the node information from the Rancher 2 API (accessible under the /v3 path). Even though the node's name is shown as "null", we can still query the API and use jq to filter the json output for a specific cluster:

$ curl -s -u token-xxxxx:secret https://rancher.example.com/v3/nodes | jq -r '.data[] | select(.clusterId == "c-zs42v")'
{
"appliedNodeVersion": 0,
"baseType": "node",
"clusterId": "c-zs42v",
"conditions": [
    {
      "status": "True",
      "type": "Initialized"
    },
    {
      "message": "waiting to register with Kubernetes",
      "status": "Unknown",
      "type": "Registered"
    },
    {
      "status": "True",
      "type": "Provisioned"
    }
],
"controlPlane": true,
"created": "2021-09-28T12:46:57Z",
"createdTS": 1632833217000,
"creatorId": null,
"customConfig": {
    "address": "10.10.204.124",
    "type": "/v3/schemas/customConfig"
},
"dockerInfo": {
    "debug": false,
    "experimentalBuild": false,
    "type": "/v3/schemas/dockerInfo"
},
"etcd": true,
"id": "c-zs42v:m-a4c0d00d69b6",
"imported": true,
"info": {
    "cpu": {
      "count": 0
    },
    "kubernetes": {
      "kubeProxyVersion": "",
      "kubeletVersion": ""
    },
    "memory": {
      "memTotalKiB": 0
    },
    "os": {
      "dockerVersion": "",
      "kernelVersion": "",
      "operatingSystem": ""
    }
},
"ipAddress": "10.10.204.124",
"links": {
    "remove": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6",
    "self": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6",
    "update": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6"
},
"name": "",
"namespaceId": null,
"nodePoolId": "",
"nodeTemplateId": null,
"requestedHostname": "xyz-node1",
"sshUser": "root",
"state": "registering",
"transitioning": "yes",
"transitioningMessage": "waiting to register with Kubernetes",
"type": "node",
"unschedulable": false,
"uuid": "56e0a435-8ad5-48c3-9560-69d0592a9afa",
"worker": false
}

Thanks to this detailed output, we now also now the original IP address ("ipAddress": "10.10.204.124") and the host name ("requestedHostname": "xyz-node1"). We can also see the same information ("state": "registering") from the monitoring. And even though check_rancher2 did show "null" as node name (retrieved from the empty "name" field), there is a unique ID of this node: "id": "c-zs42v:m-a4c0d00d69b6".

The API output also shows a specific API URL (links) to access this specific node. By accessing the URL (while already being logged in to the Rancher 2 UI), the same output can be seen in the browser. But additionally to the JSON output, multiple operations, including delete, can be triggered on the right side.

Delete a kubernetes node in Rancher 2 API

This opens a a "API Request" layer where the resulting API request is shown as curl command. But it can also be executed directly by clicking on the [Send Request] button:

As soon as this was done, the node was finally (properly) deleted from the API.

The same check_rancher2 node check now returns OK:

$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 OK - All 66 nodes are active|'nodes_total'=66;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;;

Rancher 2 monitoring shows all Kubernetes nodes OK

Looking for a managed dedicated Kubernetes environment?

Although currently called "the de-facto container infrastructure", Kubernetes is anything but easy. The complexity adds additional problems and considerations. We at Infiniroot love to share our troubleshooting knowledge when we need to tackle certain issues - but we also know this is not for everyone ("it just needs to work"). So if you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland or even in your own on-premise datacenter, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.

Infiniroot Blog: We sometimes write, too.

How to fix a Rancher 2 cluster node stuck registering in downstream cluster (not shown in UI)

Cluster node stuck in "is registering" phase

Searching for a missing cluster

Node lookup and deletion via Rancher 2 API

Looking for a managed dedicated Kubernetes environment?