This guide covers troubleshooting and resolving worker node failures in a Kubernetes cluster.
In this guide, we walk through troubleshooting and resolving worker node failures in a Kubernetes cluster. Before you begin, ensure that your lab environment is properly set up.Below is a step-by-step procedure to diagnose and fix issues on a worker node (node01).
This file points to the wrong CA certificate. Identify the correct CA file (for example, /etc/kubernetes/pki/ca.cert) and update the configuration accordingly.After making the change, restart the kubelet service:
Copy
Ask AI
root@node01:~# service kubelet restart
Then, confirm the service is active:
Copy
Ask AI
root@node01:~# service kubelet status● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Fri 2022-04-22 23:06:14 UTC; 5s ago Docs: https://kubernetes.io/docs/home/ Main PID: 20357 (kubelet)
Return to the control plane and check that both nodes are now Ready:
Copy
Ask AI
root@controlplane:~# kubectl get nodesNAME STATUS ROLES AGE VERSIONcontrolplane Ready control-plane,master 43m v1.20.0node01 Ready <none> 43m v1.20.0
Always back up configuration files before making any changes.
Step 6: Fix the Incorrect Control Plane Port in kubelet.conf
Even after the configuration fix, if node01 goes NotReady again, inspect the logs on node01. You might see an error like:
Copy
Ask AI
failed to ensure lease exists, will retry in 7s, error: Get "http://10.54.130.2:6553/api/v1/namespaces/kube-node-lease/leases/node01?...": dial tcp 10.54.130.2:6553: connect: connection refused
This indicates that the kubelet is attempting to connect to the control plane on an incorrect port (6553). To resolve this, inspect the kubelet configuration file:
Copy
Ask AI
root@node01:~# ls /etc/kubernetes/kubelet.confroot@node01:~# cat /etc/kubernetes/kubelet.conf
When troubleshooting worker node failures in a Kubernetes cluster, follow these steps:
Check the node status using kubectl get nodes.
SSH into the affected worker node and verify that the kubelet service is running.
If the kubelet service fails, review the logs using journalctl -u kubelet to identify any misconfigurations.
In this example, the issues included:
An incorrect client CA file in /var/lib/kubelet/config.yaml
An incorrect control plane port in /etc/kubernetes/kubelet.conf
Correct the misconfigurations and restart the kubelet service, then verify that the node status returns to Ready.
Following this systematic approach will help you quickly pinpoint and resolve issues during daily operations in your Kubernetes cluster. Happy troubleshooting!For further reading: