AI API Timeouts and Deadlines: Ship Reliable Apps

Your AI-powered feature works flawlessly in development. Then it hits production: a Claude Opus request takes 45 seconds, the user's browser gives up at 30, and your load balancer kills the connection at 60. Three different timeout boundaries, zero coordination between them. The user sees a spinner, then nothing. Your logs show a successful completion — for a response nobody received.

AI API calls are fundamentally different from typical REST endpoints. A database query takes 5-50ms. An AI model generating 4,000 tokens can take 10-90 seconds depending on the model, prompt complexity, and current load. Standard timeout defaults designed for traditional APIs will silently eat your AI requests. Here's how to fix that with a systematic approach using EzAI API.

Why AI APIs Are Timeout Minefields

Traditional APIs have predictable latency profiles. You call a REST endpoint, it hits a database, and responds in under 100ms. AI APIs break every assumption baked into standard HTTP infrastructure:

Variable latency by model — Claude Haiku responds in 1-3 seconds while Claude Opus might take 30-60 seconds for the same prompt
Token count determines duration — generating 100 tokens takes 2 seconds, generating 8,000 tokens takes 40+ seconds
Load-dependent spikes — during peak hours, p99 latency can 3-5x versus p50
Streaming changes the game — first byte arrives in 500ms but the full response takes 30 seconds, and your timeout needs to account for both

The default timeout: 30000 in most HTTP clients was designed for a world where 30 seconds meant something was broken. In AI, 30 seconds means the model is halfway through writing your code review.

The 4-Layer Timeout Strategy

Production AI applications need four distinct timeout boundaries. Each layer catches failures that the previous one misses.

Four timeout layers: connection → first byte → streaming → end-to-end deadline

Layer 1: Connection Timeout (5 seconds)

This catches DNS failures, network partitions, and unreachable hosts. If you can't establish a TCP connection in 5 seconds, retrying the same endpoint won't help — switch to a fallback.

python

import httpx

# Layer 1: Connection-level timeout
client = httpx.AsyncClient(
    base_url="https://ezaiapi.com",
    timeout=httpx.Timeout(
        connect=5.0,      # TCP + TLS handshake
        read=120.0,       # Between chunks (streaming)
        write=10.0,       # Sending request body
        pool=5.0,         # Waiting for connection from pool
    )
)

Layer 2: First-Byte Timeout (30 seconds)

After connecting, how long should you wait for the model to start responding? This catches cases where the API accepted your request but the model is overloaded and queued. For streaming requests, first byte means the first SSE event:

javascript

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: 'sk-your-key',
  baseURL: 'https://ezaiapi.com',
  timeout: 30_000,  // 30s first-byte timeout
});

// Streaming with first-byte awareness
const stream = await client.messages.stream({
  model: 'claude-sonnet-4-5',
  max_tokens: 4096,
  messages: [{ role: 'user', content: prompt }],
});

let firstChunkReceived = false;
const firstByteTimer = setTimeout(() => {
  if (!firstChunkReceived) {
    stream.abort();
    throw new Error('First byte timeout exceeded');
  }
}, 30_000);

for await (const event of stream) {
  if (!firstChunkReceived) {
    firstChunkReceived = true;
    clearTimeout(firstByteTimer);
  }
  processChunk(event);
}

Layer 3: Streaming Timeout (120 seconds)

Once tokens start flowing, you need a per-chunk idle timeout. If no new data arrives for 15 seconds mid-stream, the connection is likely dead — even though the TCP socket is still open. This is the timeout most developers miss.

Layer 4: End-to-End Deadline (180 seconds)

The total budget for the entire operation. This is your backstop — no matter what happens at the other three layers, the request dies at 180 seconds. Use AbortController in JavaScript or asyncio.timeout in Python:

python

import asyncio
import anthropic

client = anthropic.AsyncAnthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com",
)

async def call_with_deadline(prompt: str, deadline_seconds: int = 180):
    """End-to-end deadline wrapping the entire AI call."""
    try:
        async with asyncio.timeout(deadline_seconds):
            response = await client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=4096,
                messages=[{"role": "user", "content": prompt}],
            )
            return response.content[0].text
    except asyncio.TimeoutError:
        # Log, increment metric, return fallback
        print(f"Deadline exceeded after {deadline_seconds}s")
        return None

Circuit Breakers for AI Endpoints

When an AI provider goes down, you don't want every request in your queue to wait the full timeout before failing. A circuit breaker tracks failures and short-circuits requests when the error rate crosses a threshold. Here's a minimal implementation that works with EzAI's API:

python

import time
from dataclasses import dataclass, field

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    cooldown_seconds: float = 30.0
    failures: int = field(default=0, init=False)
    last_failure: float = field(default=0.0, init=False)
    state: str = field(default="closed", init=False)

    def can_proceed(self) -> bool:
        if self.state == "closed":
            return True
        if self.state == "open":
            # Check if cooldown has passed → half-open
            if time.time() - self.last_failure > self.cooldown_seconds:
                self.state = "half-open"
                return True
            return False
        return True  # half-open: allow one probe

    def record_success(self):
        self.failures = 0
        self.state = "closed"

    def record_failure(self):
        self.failures += 1
        self.last_failure = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "open"

# Usage with EzAI API
breaker = CircuitBreaker(failure_threshold=5, cooldown_seconds=30)

async def safe_ai_call(prompt):
    if not breaker.can_proceed():
        raise Exception("Circuit open — AI endpoint unavailable")
    try:
        result = await call_with_deadline(prompt)
        breaker.record_success()
        return result
    except Exception:
        breaker.record_failure()
        raise

This pairs naturally with multi-model fallback — when the circuit opens on one model, route to a fallback model through EzAI's unified endpoint.

Deadline Propagation Across Services

In microservice architectures, the user's request passes through an API gateway, a backend service, and then the AI API call. Each layer adds latency overhead. If your user-facing timeout is 60 seconds and your gateway adds 2 seconds of overhead, the AI call's deadline should be 58 seconds — not 60.

Pass deadlines explicitly through your service chain instead of relying on independent timeouts at each layer:

javascript

// Express middleware — propagate deadline to AI calls
function deadlineMiddleware(userTimeoutMs = 60_000) {
  return (req, res, next) => {
    const deadline = Date.now() + userTimeoutMs;
    req.deadline = deadline;

    // Auto-abort if deadline passes
    req.deadlineTimer = setTimeout(() => {
      if (!res.headersSent) {
        res.status(504).json({ error: 'Request deadline exceeded' });
      }
    }, userTimeoutMs);

    res.on('finish', () => clearTimeout(req.deadlineTimer));
    next();
  };
}

// In your route handler
app.post('/api/chat', deadlineMiddleware(60_000), async (req, res) => {
  const remainingMs = req.deadline - Date.now() - 2000; // 2s buffer
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), remainingMs);

  const response = await client.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 2048,
    messages: req.body.messages,
  }, { signal: controller.signal });

  clearTimeout(timer);
  res.json(response);
});

Model-Specific Timeout Profiles

Different models have wildly different latency profiles. Applying the same timeout to Haiku and Opus is like using the same speed limit for a highway and a school zone. Here are practical timeout values based on real-world p99 latency data:

Claude Haiku / GPT-4o Mini — Connection: 5s, First byte: 10s, Total: 30s
Claude Sonnet / GPT-4o — Connection: 5s, First byte: 20s, Total: 90s
Claude Opus / GPT-4.5 — Connection: 5s, First byte: 45s, Total: 180s
Extended Thinking (any model) — Connection: 5s, First byte: 90s, Total: 300s

With EzAI's unified API, you can switch models without changing your infrastructure — just adjust the timeout profile when you change the model parameter.

Graceful Degradation: When Timeouts Fire

A timeout isn't an error — it's a signal to degrade gracefully. Instead of showing users a blank error page, build a fallback chain:

Try the target model (Opus/Sonnet) with full timeout
On timeout, fall back to a faster model (Haiku) with shorter max_tokens
On second timeout, return cached/static response if available
Last resort: acknowledge the delay and offer to notify the user when ready

The goal isn't perfection — it's never leaving the user staring at a spinner with no feedback. Even a "this is taking longer than usual, trying a faster model..." message is better than silence. Check out our guide on retry strategies for the full picture on building resilient AI request pipelines.

Key Takeaways

Set 4 separate timeout layers: connection, first byte, streaming idle, and end-to-end deadline
Use model-specific timeout profiles — Haiku and Opus need different numbers
Add a circuit breaker to fail fast when a provider is down
Propagate deadlines through your service chain with explicit budget subtraction
Always degrade gracefully — fall back to faster models before showing errors

Timeouts are the difference between "AI feature is flaky" and "AI feature is rock-solid." Get them right once, and your users will never know how much chaos is happening behind the scenes. Get started with EzAI and build timeout-resilient AI apps from day one.

AI API Timeouts and Deadlines: Ship Reliable Apps

Why AI APIs Are Timeout Minefields

The 4-Layer Timeout Strategy

Layer 1: Connection Timeout (5 seconds)

Layer 2: First-Byte Timeout (30 seconds)

Layer 3: Streaming Timeout (120 seconds)

Layer 4: End-to-End Deadline (180 seconds)

Circuit Breakers for AI Endpoints

Deadline Propagation Across Services

Model-Specific Timeout Profiles

Graceful Degradation: When Timeouts Fire

Key Takeaways

Related Posts

AI API Retry Strategies That Actually Work

Multi-Model Fallback: Never Go Down