AI Agents in Production: Architecture That Scales

Everyone is building AI agents right now. Most of them break in production. The demo works on stage — the agent books a flight, writes code, queries a database — but deploy it to real users and you get infinite loops, hallucinated tool calls, and $400 API bills from a single runaway session.

This guide covers the architecture patterns that separate toy agents from production systems. We'll build a real agent framework in Python using Claude API through EzAI, with patterns you can steal for your own stack.

The Agent Loop: Core Architecture

Every AI agent follows the same fundamental loop: receive input, think, act, observe, repeat. The difference between a fragile demo and a production system is what happens between those steps — retry logic, token budgets, tool validation, and graceful degradation.

Here's the skeleton that every production agent needs:

python

import anthropic
import json, time

client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

class Agent:
    def __init__(self, tools, system_prompt, max_turns=15, max_tokens=100_000):
        self.tools = tools
        self.system = system_prompt
        self.max_turns = max_turns
        self.max_tokens = max_tokens
        self.messages = []
        self.total_tokens = 0

    def run(self, user_input: str) -> str:
        self.messages.append({"role": "user", "content": user_input})

        for turn in range(self.max_turns):
            response = client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=4096,
                system=self.system,
                tools=self.tools,
                messages=self.messages
            )

            self.total_tokens += response.usage.input_tokens + response.usage.output_tokens
            if self.total_tokens > self.max_tokens:
                return "[Token budget exceeded — stopping agent]"

            self.messages.append({"role": "assistant", "content": response.content})

            if response.stop_reason == "end_turn":
                return self._extract_text(response)

            if response.stop_reason == "tool_use":
                results = self._execute_tools(response)
                self.messages.append({"role": "user", "content": results})

        return "[Max turns reached — agent stopped]"

Two guardrails are doing the heavy lifting: max_turns prevents infinite loops, and max_tokens caps your spend. Without these, a confused agent will happily burn through your entire balance calling the same tool in a loop.

Tool Routing: Let the Agent Do Real Work

AI agent tool routing architecture diagram

Agent tool routing — validate inputs, execute safely, return structured results

Tools are what separate a chatbot from an agent. Claude's tool use API lets you define functions the model can call, but production code needs a validation layer between the model's intent and actual execution. Never trust raw model output to hit your database.

python

# Define tools with strict input schemas
TOOLS = [
    {
        "name": "query_database",
        "description": "Run a read-only SQL query against the analytics DB.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "SELECT query only"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "send_slack_message",
        "description": "Post a message to a Slack channel.",
        "input_schema": {
            "type": "object",
            "properties": {
                "channel": {"type": "string"},
                "text": {"type": "string"}
            },
            "required": ["channel", "text"]
        }
    }
]

# Validation + execution layer
BLOCKED_SQL = ["DROP", "DELETE", "UPDATE", "INSERT", "ALTER", "TRUNCATE"]

def execute_tool(name: str, inputs: dict) -> str:
    if name == "query_database":
        query = inputs["query"].strip().upper()
        if any(kw in query for kw in BLOCKED_SQL):
            return json.dumps({"error": "Write operations are blocked"})
        rows = db.execute_readonly(inputs["query"])
        return json.dumps(rows[:50])  # Cap result size

    if name == "send_slack_message":
        if inputs["channel"] not in ALLOWED_CHANNELS:
            return json.dumps({"error": "Channel not in allowlist"})
        slack.post(inputs["channel"], inputs["text"])
        return json.dumps({"ok": True})

    return json.dumps({"error": f"Unknown tool: {name}"})

Key patterns here: SQL queries are validated against a blocklist before touching the database, result sets are capped at 50 rows to prevent token explosion on the next turn, and Slack channels are restricted to an allowlist. The agent gets useful error messages it can reason about, not silent failures.

Memory and Context Management

Agents that run for multiple turns burn through context windows fast. A 10-turn agent conversation with tool results can easily hit 50k tokens. At $3/M input tokens for Sonnet, that's 15 cents per conversation — and it gets worse as the conversation grows because you're re-sending the entire history every turn.

The fix: summarize and compress after every N turns.

python

def compress_history(messages: list, keep_recent: int = 4) -> list:
    """Summarize old messages, keep recent ones intact."""
    if len(messages) <= keep_recent:
        return messages

    old = messages[:-keep_recent]
    recent = messages[-keep_recent:]

    summary = client.messages.create(
        model="claude-haiku-3-5",  # Cheap model for summarization
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this agent conversation so far. "
                     f"Include: tools called, results obtained, decisions made.\n\n"
                     f"{json.dumps(old, default=str)}"
        }]
    )

    compressed = [{
        "role": "user",
        "content": f"[Previous context summary: {summary.content[0].text}]"
    }]
    return compressed + recent

This pattern uses Haiku (at $0.25/M input tokens via EzAI) to compress older conversation turns into a summary, then keeps only the last 4 messages intact. You reduce context size by 60-80% while preserving the information the agent needs for its next decision.

Retry Logic and Error Recovery

Production agents hit errors constantly — rate limits, network timeouts, malformed tool calls, and the occasional model hallucination. Your agent loop needs to handle all of these without crashing or entering an infinite retry cycle.

python

import anthropic
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((
        anthropic.RateLimitError,
        anthropic.APIConnectionError,
        anthropic.InternalServerError
    ))
)
def call_api(messages, tools, system):
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        system=system,
        tools=tools,
        messages=messages
    )

# Detect loops: same tool called 3+ times with identical args
def detect_loop(messages, window=6) -> bool:
    tool_calls = []
    for msg in messages[-window:]:
        if msg["role"] == "assistant":
            for block in msg.get("content", []):
                if hasattr(block, "type") and block.type == "tool_use":
                    tool_calls.append(f"{block.name}:{json.dumps(block.input, sort_keys=True)}")
    return len(tool_calls) >= 3 and len(set(tool_calls)) == 1

The detect_loop function catches one of the most common production failures: the agent calling the same tool with the same arguments repeatedly. When detected, inject a system message telling the agent to try a different approach or report its findings. Without this, you'll see agents call query_database with an identical broken query 50 times in a row.

Multi-Model Routing for Cost Control

Per-turn cost by model — route planning to Sonnet, simple extraction to Haiku

Not every agent turn needs the same model. Planning and complex reasoning calls should use Sonnet or Opus, but simple data extraction or formatting can use Haiku at 1/12th the cost. Route by task complexity:

python

MODEL_TIERS = {
    "planning": "claude-sonnet-4-5",     # Complex reasoning
    "execution": "claude-sonnet-4-5",    # Tool use + decisions
    "extraction": "claude-haiku-3-5",    # Parse results, format output
    "summarization": "claude-haiku-3-5", # Compress context
}

def pick_model(turn: int, last_stop_reason: str) -> str:
    if turn == 0:
        return MODEL_TIERS["planning"]    # First turn: plan the approach
    if last_stop_reason == "tool_use":
        return MODEL_TIERS["execution"]   # Mid-task: keep Sonnet
    return MODEL_TIERS["extraction"]    # Formatting final output: Haiku

With EzAI's unified API, switching between models is just a string change — no separate clients, no different authentication. A 10-turn agent that routes Haiku for 4 of those turns saves roughly 30-40% on total token cost compared to running Sonnet for every turn.

Observability: Logging Every Turn

When an agent does something wrong in production, you need to know exactly which turn went sideways. Log every API call with its inputs, outputs, token counts, and timing. This is non-negotiable — you can't debug agent behavior without turn-level traces.

python

import logging, time

logger = logging.getLogger("agent")

def traced_call(messages, tools, system, turn, session_id):
    start = time.monotonic()
    response = call_api(messages, tools, system)
    elapsed = time.monotonic() - start

    logger.info(json.dumps({
        "session_id": session_id,
        "turn": turn,
        "model": response.model,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "stop_reason": response.stop_reason,
        "latency_ms": round(elapsed * 1000),
        "tool_calls": [
            b.name for b in response.content
            if hasattr(b, "type") and b.type == "tool_use"
        ]
    }))

    return response

Ship these logs to whatever you already use — Datadog, Grafana, CloudWatch, even a Postgres table. The critical fields are session_id (to trace a full agent run), turn (to see where things went wrong), and tool_calls (to see what the agent tried to do). You can monitor these through EzAI's dashboard too, which shows per-request token counts and costs in real time.

Putting It All Together

Here's how these patterns combine into a production-ready agent runner:

Token budget — Set a hard ceiling per session. Kill the loop when exceeded.
Turn limit — Cap at 15-20 turns. Agents that need more are usually stuck.
Tool validation — Allowlists, blocklists, and result size caps on every tool.
Loop detection — Catch repeated identical tool calls and break the cycle.
Context compression — Summarize old turns with a cheap model to control costs.
Model routing — Use Sonnet for reasoning, Haiku for formatting. Save 30-40%.
Turn-level logging — Every API call traced with tokens, timing, and tool calls.

The biggest mistake teams make is skipping the guardrails during development and bolting them on later. Build them into the core loop from day one. A runaway agent at 2 AM is an expensive and embarrassing lesson to learn from your billing dashboard.

Cost Breakdown: Real Numbers

Here's what a typical 10-turn agent session costs through EzAI with these optimizations applied:

Turns 1-2 (planning): Sonnet, ~4k input + 2k output = ~$0.024
Turns 3-8 (execution): Sonnet, ~8k input + 1.5k output avg = ~$0.063
Turns 9-10 (extraction): Haiku, ~3k input + 500 output = ~$0.002
1x context compression: Haiku, ~6k input + 500 output = ~$0.002
Total: ~$0.09 per session

Without model routing and compression, the same session runs about $0.16 — nearly double. Over 10,000 sessions/month, that's a $700 difference. Not trivial.

Start building at ezaiapi.com/dashboard. All Claude, GPT, and Gemini models are available through the same endpoint, and you can switch between them with a single parameter change. Check out our guides on tool calling and multi-model fallback for more production patterns.

AI Agents in Production: Architecture That Scales

The Agent Loop: Core Architecture

Tool Routing: Let the Agent Do Real Work

Memory and Context Management

Retry Logic and Error Recovery

Multi-Model Routing for Cost Control

Observability: Logging Every Turn

Putting It All Together

Cost Breakdown: Real Numbers

Related Posts

AI Tool Use & Function Calling via API

Multi-Model Fallback: Keep Your AI App Running 24/7