A pod crashes at 3 AM. You SSH into the cluster, run kubectl describe pod, scroll through 200 lines of events, cross-reference the container logs, check resource limits, and eventually figure out it was an OOM kill from a memory leak in a sidecar. That took 20 minutes. An AI can do it in 2 seconds.
This tutorial builds a Python CLI tool that collects Kubernetes diagnostic data and sends it to Claude for instant root cause analysis. It handles CrashLoopBackOff, ImagePullBackOff, OOMKilled, pending pods, and connection failures — the five errors that account for roughly 80% of K8s incidents. The whole thing runs through EzAI API, so you get access to Claude at a fraction of the direct cost.
How the K8s Troubleshooter Works
The architecture is straightforward: collect cluster state, build a diagnostic prompt, send it to Claude, and format the response into actionable steps. No agents, no chains — just well-structured kubectl output piped into a smart model.
kubectl output → error classification → Claude analysis → fix suggestions with runbook commands
The tool runs three kubectl commands in parallel — get pods, describe pod, and logs — merges the output, classifies the error type, then constructs a targeted prompt. Claude doesn't get a generic "fix my pod" request. It gets structured diagnostic data with the error category already identified, which dramatically improves the quality of the analysis.
Setting Up the Project
You need Python 3.10+, kubectl configured with cluster access, and an EzAI API key. Install the dependencies:
pip install anthropic rich
export EZAI_API_KEY="sk-your-key-here"
Collecting Kubernetes Diagnostics
The first module handles kubectl calls. Running them sequentially is slow — a busy cluster might take 3-4 seconds per command. We use asyncio.subprocess to fire all three in parallel and merge results:
import asyncio, subprocess, json
async def run_kubectl(cmd: str) -> str:
"""Run a kubectl command and return stdout."""
proc = await asyncio.create_subprocess_shell(
cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await proc.communicate()
if proc.returncode != 0:
return f"ERROR: {stderr.decode().strip()}"
return stdout.decode().strip()
async def collect_diagnostics(namespace: str, pod: str) -> dict:
"""Gather pod status, events, and logs in parallel."""
describe, logs, events = await asyncio.gather(
run_kubectl(f"kubectl describe pod {pod} -n {namespace}"),
run_kubectl(f"kubectl logs {pod} -n {namespace} --tail=100"),
run_kubectl(
f"kubectl get events -n {namespace} "
f"--field-selector involvedObject.name={pod} "
f"--sort-by=.lastTimestamp"
),
)
return {
"describe": describe,
"logs": logs,
"events": events,
}
Three concurrent subprocesses, all finishing within the slowest one's latency. The --tail=100 flag on logs prevents pulling megabytes of output on chatty containers — Claude doesn't need your entire access log to diagnose a segfault.
Classifying the Error
Before sending anything to the AI, we classify the error locally. This lets us craft a targeted system prompt and keeps token usage low by only including relevant context:
ERROR_PATTERNS = {
"CrashLoopBackOff": ["CrashLoopBackOff", "back-off restarting"],
"ImagePullBackOff": ["ImagePullBackOff", "ErrImagePull"],
"OOMKilled": ["OOMKilled", "memory limit"],
"Pending": ["Pending", "Unschedulable", "insufficient"],
"ConnectionRefused": ["connection refused", "no such host"],
}
def classify_error(diagnostics: dict) -> str:
"""Match kubectl output against known error patterns."""
combined = "\n".join(diagnostics.values()).lower()
for category, patterns in ERROR_PATTERNS.items():
if any(p.lower() in combined for p in patterns):
return category
return "Unknown"
Simple string matching works here because Kubernetes error messages are standardized. CrashLoopBackOff is always CrashLoopBackOff — there's no ambiguity. Save the AI tokens for actual reasoning, not pattern matching you can do with in.
Analyzing Pod Failures with AI
Now the core: sending structured diagnostics to Claude via EzAI API and getting back a diagnosis with concrete fix commands. The system prompt is tailored to the detected error category:
import anthropic
client = anthropic.Anthropic(
api_key="sk-your-ezai-key",
base_url="https://ezaiapi.com",
)
SYSTEM_PROMPTS = {
"CrashLoopBackOff": """You are a Kubernetes SRE expert. The pod is in CrashLoopBackOff.
Analyze the describe output and container logs to find:
1. The exit code and signal (OOM=137, segfault=139, app error=1)
2. Whether it's a config issue (missing env vars, bad mounts)
3. Whether it's an application bug (stack trace in logs)
Provide the root cause, then exact kubectl/yaml commands to fix it.""",
"OOMKilled": """You are a Kubernetes SRE expert. The pod was OOMKilled.
Analyze memory limits vs actual usage. Check for:
1. Memory leaks (steadily increasing RSS in logs)
2. Undersized limits (legitimate usage exceeding limit)
3. JVM/runtime overhead not accounted for
Recommend specific memory limit values based on the data.""",
# ... similar prompts for other categories
}
def analyze_pod(diagnostics: dict, error_type: str) -> str:
"""Send diagnostics to Claude and get analysis."""
system = SYSTEM_PROMPTS.get(
error_type,
"You are a Kubernetes SRE. Diagnose this pod failure."
)
user_msg = f"""Error type: {error_type}
=== POD DESCRIBE ===
{diagnostics['describe'][:4000]}
=== CONTAINER LOGS (last 100 lines) ===
{diagnostics['logs'][:3000]}
=== RECENT EVENTS ===
{diagnostics['events'][:2000]}"""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1500,
system=system,
messages=[{"role": "user", "content": user_msg}],
)
return response.content[0].text
We truncate each kubectl output to stay within token limits. Describe output caps at 4,000 chars, logs at 3,000, events at 2,000 — roughly 2,500 input tokens total. With Claude Sonnet 4.5 through EzAI, that analysis costs about $0.003 per diagnosis. Even running it 100 times a day, you're looking at 30 cents.
The model choice matters: Sonnet is fast enough for on-call use (under 3 seconds) and accurate enough to catch subtle issues like "your liveness probe has a 1-second timeout but your app takes 2 seconds to start." If you need deeper analysis on complex multi-container pods, swap to claude-opus-4 — still cheap through EzAI.
Building the CLI
Wrap everything in a clean CLI using Rich for formatted terminal output. The full script scans a namespace, identifies unhealthy pods, and diagnoses each one:
import sys, asyncio
from rich.console import Console
from rich.panel import Panel
from rich.markdown import Markdown
console = Console()
async def get_unhealthy_pods(namespace: str) -> list[str]:
"""Find pods not in Running/Completed state."""
output = await run_kubectl(
f"kubectl get pods -n {namespace} -o json"
)
pods = json.loads(output)
return [
p["metadata"]["name"]
for p in pods["items"]
if p["status"]["phase"] not in ("Running", "Succeeded")
]
async def main():
ns = sys.argv[1] if len(sys.argv) > 1 else "default"
console.print(f"\n🔍 Scanning namespace [bold]{ns}[/bold]...")
sick_pods = await get_unhealthy_pods(ns)
if not sick_pods:
console.print("[green]✓ All pods healthy![/green]")
return
console.print(f"[red]Found {len(sick_pods)} unhealthy pod(s)[/red]\n")
for pod in sick_pods:
console.print(f"📋 Diagnosing [bold]{pod}[/bold]...")
diag = await collect_diagnostics(ns, pod)
err = classify_error(diag)
analysis = analyze_pod(diag, err)
console.print(Panel(
Markdown(analysis),
title=f"[bold]{pod}[/bold] — {err}",
border_style="red" if err != "Unknown" else "yellow",
))
asyncio.run(main())
Run it with python k8s_troubleshooter.py production and you get Rich-formatted panels with markdown diagnosis for each broken pod. The output includes root cause, explanation, and copy-pasteable kubectl commands to fix the issue.
Production Hardening
The basic version works, but production on-call tooling needs a few more things. Here's what to add before you trust it with your 3 AM pages:
- Caching: Hash the diagnostic output and cache AI responses for 5 minutes. Same pod, same error, same response — no point burning tokens on duplicate analysis when the pod restarts every 30 seconds in CrashLoopBackOff.
- Namespace scanning: Add a
--all-namespacesflag that iterates over every namespace. Filter outkube-systempods that are intentionally not running (like completed Jobs). - Slack/PagerDuty integration: Pipe the AI analysis into your alerting pipeline. When a pod fails, the on-call engineer gets the diagnosis in the alert, not just "pod crashed."
- Cost tracking: Log
response.usage.input_tokensandresponse.usage.output_tokensfrom each call. With EzAI's transparent pricing, you can predict monthly spend accurately.
Switching Models for Different Scenarios
One of the advantages of routing through EzAI is easy model switching. You can adjust the model based on severity:
def pick_model(error_type: str, pod_count: int) -> str:
"""Use cheaper models for simple errors, heavy models for complex ones."""
if error_type in ("ImagePullBackOff", "Pending"):
return "claude-haiku-3-5" # fast + cheap for obvious fixes
if pod_count > 5:
return "claude-haiku-3-5" # keep costs down for batch scans
return "claude-sonnet-4-5" # deeper analysis for tricky errors
ImagePullBackOff is almost always a typo in the image tag or missing registry credentials. Haiku can diagnose that for $0.0003 per call — a tenth of a cent. Reserve Sonnet for the cases where the AI actually needs to reason through container interactions and timing issues.
What's Next
You've got a working K8s troubleshooter that turns 20-minute debugging sessions into 3-second diagnoses. Some directions to take it from here:
- Add Helm release analysis — pull
helm historyand diff the last two releases when a deployment fails after an upgrade - Build a watch mode that monitors pods continuously and alerts when new failures appear, complete with pre-computed diagnosis
- Integrate with the AI incident responder for full automated incident management
- Check out our API docs to explore extended thinking for complex multi-service debugging
The full source code for this project is about 150 lines of Python. No frameworks, no abstractions — just kubectl, the Anthropic SDK, and EzAI handling the model routing. Start with a single namespace, prove it works for your team, then expand.