Build an AI Kubernetes Troubleshooter with Python

A pod crashes at 3 AM. You SSH into the cluster, run kubectl describe pod, scroll through 200 lines of events, cross-reference the container logs, check resource limits, and eventually figure out it was an OOM kill from a memory leak in a sidecar. That took 20 minutes. An AI can do it in 2 seconds.

This tutorial builds a Python CLI tool that collects Kubernetes diagnostic data and sends it to Claude for instant root cause analysis. It handles CrashLoopBackOff, ImagePullBackOff, OOMKilled, pending pods, and connection failures — the five errors that account for roughly 80% of K8s incidents. The whole thing runs through EzAI API, so you get access to Claude at a fraction of the direct cost.

How the K8s Troubleshooter Works

The architecture is straightforward: collect cluster state, build a diagnostic prompt, send it to Claude, and format the response into actionable steps. No agents, no chains — just well-structured kubectl output piped into a smart model.

AI K8s Troubleshooter architecture flow diagram

kubectl output → error classification → Claude analysis → fix suggestions with runbook commands

The tool runs three kubectl commands in parallel — get pods, describe pod, and logs — merges the output, classifies the error type, then constructs a targeted prompt. Claude doesn't get a generic "fix my pod" request. It gets structured diagnostic data with the error category already identified, which dramatically improves the quality of the analysis.

Setting Up the Project

You need Python 3.10+, kubectl configured with cluster access, and an EzAI API key. Install the dependencies:

bash

pip install anthropic rich
export EZAI_API_KEY="sk-your-key-here"

Collecting Kubernetes Diagnostics

The first module handles kubectl calls. Running them sequentially is slow — a busy cluster might take 3-4 seconds per command. We use asyncio.subprocess to fire all three in parallel and merge results:

python

import asyncio, subprocess, json

async def run_kubectl(cmd: str) -> str:
    """Run a kubectl command and return stdout."""
    proc = await asyncio.create_subprocess_shell(
        cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    stdout, stderr = await proc.communicate()
    if proc.returncode != 0:
        return f"ERROR: {stderr.decode().strip()}"
    return stdout.decode().strip()

async def collect_diagnostics(namespace: str, pod: str) -> dict:
    """Gather pod status, events, and logs in parallel."""
    describe, logs, events = await asyncio.gather(
        run_kubectl(f"kubectl describe pod {pod} -n {namespace}"),
        run_kubectl(f"kubectl logs {pod} -n {namespace} --tail=100"),
        run_kubectl(
            f"kubectl get events -n {namespace} "
            f"--field-selector involvedObject.name={pod} "
            f"--sort-by=.lastTimestamp"
        ),
    )
    return {
        "describe": describe,
        "logs": logs,
        "events": events,
    }

Three concurrent subprocesses, all finishing within the slowest one's latency. The --tail=100 flag on logs prevents pulling megabytes of output on chatty containers — Claude doesn't need your entire access log to diagnose a segfault.

Classifying the Error

Before sending anything to the AI, we classify the error locally. This lets us craft a targeted system prompt and keeps token usage low by only including relevant context:

python

ERROR_PATTERNS = {
    "CrashLoopBackOff": ["CrashLoopBackOff", "back-off restarting"],
    "ImagePullBackOff": ["ImagePullBackOff", "ErrImagePull"],
    "OOMKilled": ["OOMKilled", "memory limit"],
    "Pending": ["Pending", "Unschedulable", "insufficient"],
    "ConnectionRefused": ["connection refused", "no such host"],
}

def classify_error(diagnostics: dict) -> str:
    """Match kubectl output against known error patterns."""
    combined = "\n".join(diagnostics.values()).lower()
    for category, patterns in ERROR_PATTERNS.items():
        if any(p.lower() in combined for p in patterns):
            return category
    return "Unknown"

Simple string matching works here because Kubernetes error messages are standardized. CrashLoopBackOff is always CrashLoopBackOff — there's no ambiguity. Save the AI tokens for actual reasoning, not pattern matching you can do with in.

Analyzing Pod Failures with AI

Now the core: sending structured diagnostics to Claude via EzAI API and getting back a diagnosis with concrete fix commands. The system prompt is tailored to the detected error category:

python

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-ezai-key",
    base_url="https://ezaiapi.com",
)

SYSTEM_PROMPTS = {
    "CrashLoopBackOff": """You are a Kubernetes SRE expert. The pod is in CrashLoopBackOff.
Analyze the describe output and container logs to find:
1. The exit code and signal (OOM=137, segfault=139, app error=1)
2. Whether it's a config issue (missing env vars, bad mounts)
3. Whether it's an application bug (stack trace in logs)
Provide the root cause, then exact kubectl/yaml commands to fix it.""",
    "OOMKilled": """You are a Kubernetes SRE expert. The pod was OOMKilled.
Analyze memory limits vs actual usage. Check for:
1. Memory leaks (steadily increasing RSS in logs)
2. Undersized limits (legitimate usage exceeding limit)
3. JVM/runtime overhead not accounted for
Recommend specific memory limit values based on the data.""",
    # ... similar prompts for other categories
}

def analyze_pod(diagnostics: dict, error_type: str) -> str:
    """Send diagnostics to Claude and get analysis."""
    system = SYSTEM_PROMPTS.get(
        error_type,
        "You are a Kubernetes SRE. Diagnose this pod failure."
    )

    user_msg = f"""Error type: {error_type}

=== POD DESCRIBE ===
{diagnostics['describe'][:4000]}

=== CONTAINER LOGS (last 100 lines) ===
{diagnostics['logs'][:3000]}

=== RECENT EVENTS ===
{diagnostics['events'][:2000]}"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1500,
        system=system,
        messages=[{"role": "user", "content": user_msg}],
    )
    return response.content[0].text

We truncate each kubectl output to stay within token limits. Describe output caps at 4,000 chars, logs at 3,000, events at 2,000 — roughly 2,500 input tokens total. With Claude Sonnet 4.5 through EzAI, that analysis costs about $0.003 per diagnosis. Even running it 100 times a day, you're looking at 30 cents.

The model choice matters: Sonnet is fast enough for on-call use (under 3 seconds) and accurate enough to catch subtle issues like "your liveness probe has a 1-second timeout but your app takes 2 seconds to start." If you need deeper analysis on complex multi-container pods, swap to claude-opus-4 — still cheap through EzAI.

Building the CLI

Wrap everything in a clean CLI using Rich for formatted terminal output. The full script scans a namespace, identifies unhealthy pods, and diagnoses each one:

python

import sys, asyncio
from rich.console import Console
from rich.panel import Panel
from rich.markdown import Markdown

console = Console()

async def get_unhealthy_pods(namespace: str) -> list[str]:
    """Find pods not in Running/Completed state."""
    output = await run_kubectl(
        f"kubectl get pods -n {namespace} -o json"
    )
    pods = json.loads(output)
    return [
        p["metadata"]["name"]
        for p in pods["items"]
        if p["status"]["phase"] not in ("Running", "Succeeded")
    ]

async def main():
    ns = sys.argv[1] if len(sys.argv) > 1 else "default"
    console.print(f"\n🔍 Scanning namespace [bold]{ns}[/bold]...")

    sick_pods = await get_unhealthy_pods(ns)
    if not sick_pods:
        console.print("[green]✓ All pods healthy![/green]")
        return

    console.print(f"[red]Found {len(sick_pods)} unhealthy pod(s)[/red]\n")

    for pod in sick_pods:
        console.print(f"📋 Diagnosing [bold]{pod}[/bold]...")
        diag = await collect_diagnostics(ns, pod)
        err = classify_error(diag)
        analysis = analyze_pod(diag, err)

        console.print(Panel(
            Markdown(analysis),
            title=f"[bold]{pod}[/bold] — {err}",
            border_style="red" if err != "Unknown" else "yellow",
        ))

asyncio.run(main())

Run it with python k8s_troubleshooter.py production and you get Rich-formatted panels with markdown diagnosis for each broken pod. The output includes root cause, explanation, and copy-pasteable kubectl commands to fix the issue.

Production Hardening

The basic version works, but production on-call tooling needs a few more things. Here's what to add before you trust it with your 3 AM pages:

Caching: Hash the diagnostic output and cache AI responses for 5 minutes. Same pod, same error, same response — no point burning tokens on duplicate analysis when the pod restarts every 30 seconds in CrashLoopBackOff.
Namespace scanning: Add a --all-namespaces flag that iterates over every namespace. Filter out kube-system pods that are intentionally not running (like completed Jobs).
Slack/PagerDuty integration: Pipe the AI analysis into your alerting pipeline. When a pod fails, the on-call engineer gets the diagnosis in the alert, not just "pod crashed."
Cost tracking: Log response.usage.input_tokens and response.usage.output_tokens from each call. With EzAI's transparent pricing, you can predict monthly spend accurately.

Switching Models for Different Scenarios

One of the advantages of routing through EzAI is easy model switching. You can adjust the model based on severity:

python

def pick_model(error_type: str, pod_count: int) -> str:
    """Use cheaper models for simple errors, heavy models for complex ones."""
    if error_type in ("ImagePullBackOff", "Pending"):
        return "claude-haiku-3-5"   # fast + cheap for obvious fixes
    if pod_count > 5:
        return "claude-haiku-3-5"   # keep costs down for batch scans
    return "claude-sonnet-4-5"      # deeper analysis for tricky errors

ImagePullBackOff is almost always a typo in the image tag or missing registry credentials. Haiku can diagnose that for $0.0003 per call — a tenth of a cent. Reserve Sonnet for the cases where the AI actually needs to reason through container interactions and timing issues.

What's Next

You've got a working K8s troubleshooter that turns 20-minute debugging sessions into 3-second diagnoses. Some directions to take it from here:

Add Helm release analysis — pull helm history and diff the last two releases when a deployment fails after an upgrade
Build a watch mode that monitors pods continuously and alerts when new failures appear, complete with pre-computed diagnosis
Integrate with the AI incident responder for full automated incident management
Check out our API docs to explore extended thinking for complex multi-service debugging

The full source code for this project is about 150 lines of Python. No frameworks, no abstractions — just kubectl, the Anthropic SDK, and EzAI handling the model routing. Start with a single namespace, prove it works for your team, then expand.

Build an AI Kubernetes Troubleshooter with Python

How the K8s Troubleshooter Works

Setting Up the Project

Collecting Kubernetes Diagnostics

Classifying the Error

Analyzing Pod Failures with AI

Building the CLI

Production Hardening

Switching Models for Different Scenarios

What's Next

Related Posts

Build an AI Incident Responder

Build an AI Log Analyzer