Demo Standard K8s Issues and Explore MCP Tools

Hello everyone — welcome to the first lab. In this hands-on lesson we will reproduce two common Kubernetes cluster issues, troubleshoot them manually, and then repeat the same analysis using KAgent (an AI-assisted Kubernetes agent). The goal is to show how correlating diagnostics (kubectl, events, metrics, quotas) leads to the root cause, and how AI can speed up discovery and remediation. Task summary

Problem	Namespace	Symptom
`order-api` Service not routing traffic	`default`	Service exists but has no endpoints; curl to NodePort fails
`inventory-service` HPA not scaling to `minReplicas`	`backend-apps`	HPA shows `minReplicas=3` but only 1 pod runs; events show pod creation failures

Follow a reproducible troubleshooting pattern: observe symptoms, gather cluster state (svc/pods/endpoints/events/metrics/quotas), link evidence to possible causes, and apply the smallest safe remediation. Use kubectl + events + metrics to avoid misdirection.

Task 1 — Order API: Service exists but traffic not routed

Reproduce symptom:

From the host try the NodePort:

curl http://localhost:30081

(Initially this will fail because the Service has no endpoints.)

Gather basic cluster state in the default namespace:

kubectl get svc -n default
kubectl get pods -n default --show-labels
kubectl get endpoints -n default

You should see the order-api Service present but with empty endpoints. Inspect the Service manifest to check selectors:

metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"order-api","namespace":"default"},"spec":{"ports":[{"nodePort":30081,"port":80,"targetPort":80}],"selector":{"app":"order-api","version":"v1"},"type":"NodePort"}}
  creationTimestamp: "2025-12-16T10:29:13Z"
  name: order-api
  namespace: default
spec:
  clusterIP: 10.43.187.240
  ports:
  - nodePort: 30081
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: order-api
    version: v1
  type: NodePort

Diagnosis:

Running pods show label app=order-api but version=v2 (pod template uses v2).
The Service selector is version=v1. This selector mismatch means the Service selects no pods → no endpoints → traffic not routed.

Fix options:

Edit the Service selector to match pods, or update the Deployment labels to match the Service.
To change the Service selector to v2 (example using kubectl patch):

kubectl patch svc order-api -n default --type='json' -p='[{"op":"replace","path":"/spec/selector/version","value":"v2"}]'

Alternatively, edit the Service interactively:

kubectl edit svc order-api -n default
# change spec.selector.version: v1 -> v2

Verify:

kubectl get endpoints -n default
curl http://localhost:30081

After the fix the Service endpoints are populated and the curl returns the application page (or the pod’s default page if using an nginx image). Root cause: label/selector mismatch (version=v1 vs version=v2) between the Service and the Deployment pods.

Task 2 — HPA not reaching minReplicas in `backend-apps`

Scenario: An HPA is configured for inventory-service with:

CPU target: 80%
minReplicas: 3
maxReplicas: 10

Steps to investigate:

Inspect HPA and pods:

kubectl get hpa -n backend-apps
kubectl get pods -n backend-apps

Example HPA output:

NAME                   REFERENCE                       TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
inventory-service      Deployment/inventory-service    cpu: 0%/80%     3         10        3          22m

But pods show only 1 replica running.

Describe the HPA to view events/conditions:

kubectl describe hpa -n backend-apps

You may see events like:

Warning  FailedGetResourceMetric      failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Warning  FailedComputeMetricsReplicas invalid metrics (1 invalid out of 1), first error is: failed to get cpu resource metric value

Check cluster metrics availability:

kubectl top nodes

If kubectl top nodes returns values, node metrics server is responding. HPA CPU/resource metrics may still be affected if Pods’ metrics are missing or metrics-server has issues.

Inspect namespace events for pod creation failures:

kubectl get events -n backend-apps --sort-by='.lastTimestamp'

Look for events such as Pod ... Forbidden: exceeded quota or FailedScheduling.

Check ResourceQuota in the namespace:

kubectl get resourcequota -n backend-apps

Example output:

NAME               REQUEST                                                        AGE   LIMIT
compute-quota      pods: 1/1, requests.cpu: 100m/2, requests.memory: 128Mi/2903Mi    25m   limits.cpu: 200m/2, limits.memory: 256Mi/2903Mi

Here the namespace quota limits pods to 1/1, preventing the HPA from creating the minReplicas=3. Fix:

Update the ResourceQuota to allow more pods (for example pods: 10) or adjust quota to match expected application scale.

kubectl edit resourcequota compute-quota -n backend-apps
# change pods: 1 to pods: 10 (or apply an updated manifest)
kubectl rollout restart deployment inventory-service -n backend-apps

After increasing the quota, the cluster can create additional pods and the HPA should converge to the desired replicas.

Verify:

kubectl get pods -n backend-apps
kubectl get events -n backend-apps --sort-by='.lastTimestamp'
kubectl describe hpa -n backend-apps

Important diagnosis notes:

The HPA warnings about metrics indicate a metrics failure path that should be investigated (metrics-server, kubelet metrics, scraping), but the immediate blocker preventing scale-up was the ResourceQuota restricting pods to 1. Always correlate events + quotas + metrics to determine the decisive cause.

Do not increase namespace quotas indiscriminately in production. Align quota changes with capacity planning and organizational policies. If unsure, request approval or test changes in a non-production environment first.

Manual troubleshooting summary

Issue	Evidence	Fix
`order-api` Service no endpoints	Service selector `version=v1` vs pods `version=v2`	Patch Service selector to `v2` or reconcile Deployment labels
`inventory-service` HPA not scaling	HPA shows metric errors; events show `pods: 1/1` quota	Increase namespace `ResourceQuota` pods limit and re-rollout deployment

Troubleshooting tip: Combine kubectl get/describe, events, metrics (kubectl top), and namespace quotas to build the causal chain.

KAgent: AI-powered Kubernetes troubleshooting

KAgent exposes a chat-style UI and executes a defined toolset (get resources, describe, get YAML, apply manifests, etc.). When given a natural language prompt such as: “Why isn’t the service order-api in namespace default routing?” the agent runs investigative commands, aggregates the results, and sends them to an LLM for analysis. The agent can iterate (requesting additional data) until it provides a prioritized diagnosis and actionable remediation. Example metadata the agent uses to identify resources:

{
  "namespace": "default",
  "resource_name": "order-api",
  "resource_type": "service"
}

The agent detects the selector mismatch (version=v1 vs pods version=v2), recommends updating the Service selector to version=v2, and can apply that change automatically (via a manifest or kubectl apply) and then re-check endpoints.

A screenshot of a chat interface for a Kubernetes agent (kagent/k8s-agent) showing a message that the service "order-api" now uses the selector app=order-api, version=v2. The right sidebar lists available k8s tools and agent details while a cursor highlights the selector text.

KAgent follows a similar investigation for the HPA + inventory-service: describe HPA, check events, inspect ReplicaSet/Deployment, and verify ResourceQuota. It produces a prioritized list of causes (metrics-server issues vs quota) and recommends changing the ResourceQuota to allow more pods as the decisive remediation.

A screenshot of a chat-style web UI showing an AI agent response titled "kagent/k8s-agent" with troubleshooting steps about a Kubernetes HPA not scaling, and a text input box at the bottom. The right sidebar lists k8s tools/commands while the left shows the chat list.

Advantages of using KAgent

Natural language triage: describe the issue in plain English and let the agent gather evidence.
Faster discovery: consistent, repeatable investigative flow reduces manual rework.
Evidence surfacing: shows exactly which commands and outputs lead to a conclusion.
Improved MTTR and operational consistency across teams.

Closing and next steps

In this lesson we:

Manually diagnosed two common Kubernetes problems: a Service selector mismatch and a namespace ResourceQuota preventing HPA scaling.
Demonstrated how KAgent can automate the same investigative steps and recommend or perform safe remediation.

Next lessons will cover:

Building custom agents and configuring LLM providers
Defining a safe toolset and guardrails for automated remediation
Instrumenting clusters to improve metrics and observability

References and further reading

Introduction

KAgent Installation & Architecture Overview

KMCP Installation & Overview

System Prompts for AI Agent Building

Debugging AI Agents

Creating AI Agents - Bring It all together

Demo Standard K8s Issues and Explore MCP Tools

Task 1 — Order API: Service exists but traffic not routed

Task 2 — HPA not reaching minReplicas in `backend-apps`

Manual troubleshooting summary

KAgent: AI-powered Kubernetes troubleshooting

Closing and next steps

Watch Video

Practice Lab

​Task 1 — Order API: Service exists but traffic not routed

​Task 2 — HPA not reaching minReplicas in backend-apps

​Manual troubleshooting summary

​KAgent: AI-powered Kubernetes troubleshooting

​Closing and next steps

Watch Video

Practice Lab

Task 1 — Order API: Service exists but traffic not routed

Task 2 — HPA not reaching minReplicas in `backend-apps`

Manual troubleshooting summary

KAgent: AI-powered Kubernetes troubleshooting

Closing and next steps