AI API Backpressure: Stop Overload From Killing Your Service

Your AI gateway hits 1,200 RPS during a launch. The upstream model can sustain 400. Without backpressure, the queue grows unbounded, p99 latency rockets to 90 seconds, healthchecks start failing, the load balancer panics, and you page on-call at 3 AM. With backpressure, slow callers see 429 Too Many Requests, fast callers degrade gracefully, and your dashboard stays green.

This post is about the boring plumbing that keeps AI APIs alive under load: rejecting requests early, bounding queues, and shaping traffic with token buckets. We'll use real code against ezaiapi.com, but every pattern works against any upstream provider.

Why AI APIs need backpressure more than most

Three things make LLM workloads brutal compared to a regular REST service:

Per-request cost is huge. A single completion can take 30 seconds and consume 8K tokens. One stuck request occupies a worker for the duration of a small batch job.
Latency variance is enormous. A 200-token prompt with a 100-token answer might take 800ms. The same prompt with a long answer can take 45 seconds. Capacity planning by averages lies to you.
Upstream rate limits are strict and tiered. Anthropic, OpenAI, Google all enforce per-minute token and request limits. Cross them and you cascade failures into your own users.

If you don't push back at the edge, the queue absorbs all the pain — until it doesn't. You want to fail at the door, not in the middle of a 30-second stream.

The three patterns you actually need

Three backpressure strategies: reject, buffer, throttle

Reject, buffer, throttle — pick one per layer, stack them at the edges

The classic Reactive Streams playbook gives you four primitives: buffer, drop, throttle, and signal back. For AI gateways, the working set boils down to:

Reject early with 429 and a Retry-After header when concurrency exceeds your capacity.
Bounded queue in the worker so spikes get smoothed but the queue can't grow forever.
Token bucket per API key so noisy neighbors can't starve everyone else.

Stack them. Throttle at the edge, buffer in the worker, reject when the buffer is full. Now you have a system that degrades predictably instead of falling off a cliff.

Pattern 1: Reject early with 429 + Retry-After

The cheapest backpressure is the one that runs before you allocate anything. Track in-flight requests with an atomic counter. If it crosses a threshold, reject immediately. The caller sees a real HTTP status, not a 60-second timeout.

javascript

// gateway.js — concurrency limiter in front of EzAI
import express from "express";
import Anthropic from "@anthropic-ai/sdk";

const MAX_INFLIGHT = 200;
let inflight = 0;

const client = new Anthropic({
  apiKey: process.env.EZAI_KEY,
  baseURL: "https://ezaiapi.com",
});

const app = express();
app.use(express.json());

app.post("/chat", async (req, res) => {
  if (inflight >= MAX_INFLIGHT) {
    res.setHeader("Retry-After", "2");
    return res.status(429).json({ error: "overloaded" });
  }
  inflight++;
  try {
    const msg = await client.messages.create({
      model: "claude-sonnet-4-5",
      max_tokens: 1024,
      messages: req.body.messages,
    });
    res.json(msg);
  } finally {
    inflight--;
  }
});

app.listen(8080);

Two things make this work in practice: pick MAX_INFLIGHT based on your actual upstream concurrency budget (usually 2–4× your steady-state throughput target), and always set Retry-After. Good clients respect it; bad clients hammer you regardless and you can rate-limit them at the LB.

Pattern 2: Bounded queue with timeout

Pure rejection is harsh on bursty traffic. A small bounded queue absorbs micro-spikes without letting load build up. The trick is the bound — and a per-item deadline so requests that have been waiting too long get rejected before they're even started.

python

# worker.py — bounded queue + per-item deadline
import asyncio, time, anthropic

QUEUE_MAX = 50
WORKERS = 8
ITEM_TTL = 5.0  # reject if waiting > 5s

q: asyncio.Queue = asyncio.Queue(maxsize=QUEUE_MAX)
client = anthropic.AsyncAnthropic(
    api_key=os.environ["EZAI_KEY"],
    base_url="https://ezaiapi.com",
)

async def enqueue(payload):
    item = {"payload": payload, "enq": time.monotonic(),
            "fut": asyncio.Future()}
    try:
        q.put_nowait(item)
    except asyncio.QueueFull:
        raise RuntimeError("overloaded")
    return await item["fut"]

async def worker():
    while True:
        item = await q.get()
        if time.monotonic() - item["enq"] > ITEM_TTL:
            item["fut"].set_exception(TimeoutError("queue stale"))
            continue
        try:
            msg = await client.messages.create(**item["payload"])
            item["fut"].set_result(msg)
        except Exception as e:
            item["fut"].set_exception(e)

async def main():
    for _ in range(WORKERS):
        asyncio.create_task(worker())

The TTL check is what separates a queue that helps from a queue that hides problems. Without it, a 30-second backlog still gets processed even though the user has long since given up — wasting tokens, money, and worker time on responses nobody will read. Pair this with request deadlines end-to-end.

Pattern 3: Token bucket per API key

Concurrency limits protect the system. Token buckets protect tenants from each other. One client running a backfill at 800 RPS shouldn't starve the rest of your customers. Track tokens per key, refill at a steady rate, reject when empty.

python

import time
from collections import defaultdict

class TokenBucket:
    def __init__(self, rate, burst):
        self.rate = rate          # tokens/sec
        self.burst = burst        # bucket size
        self.tokens = burst
        self.ts = time.monotonic()

    def take(self, n=1):
        now = time.monotonic()
        self.tokens = min(self.burst,
                          self.tokens + (now - self.ts) * self.rate)
        self.ts = now
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

buckets = defaultdict(lambda: TokenBucket(rate=10, burst=30))

def allow(api_key, cost=1):
    return buckets[api_key].take(cost)

Two upgrades worth doing once you've shipped this:

Cost-weighted tokens. Take more tokens for expensive models or long max_tokens. A request to claude-opus-4 with 8K output costs you 30× a Haiku call — your bucket should reflect that.
Move state to Redis. The in-process version above is fine for one node. With multiple gateway replicas, use a Lua script on Redis to make the take-and-decrement atomic.

Surfacing backpressure to clients

Backpressure only works if callers react. Three things make that easy:

Standard status codes. Use 429 for rate limit and 503 for "overloaded, try later". Don't invent 418 or wrap errors in 200s.
Honest Retry-After. A real number (Retry-After: 3) lets smart retry logic back off without guesswork.
Expose limit headers. Mirror the upstream pattern: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. SDKs that already understand Anthropic's headers will Just Work.

What to monitor

You can't tune what you can't see. The four metrics that matter:

Queue depth (gauge). If it sits non-zero, your steady-state capacity is too low.
Reject rate (counter). By reason: 429_concurrency, 429_token_bucket, queue_full, queue_stale.
Wait time histogram. Time from enqueue to worker pickup. p99 should be under your TTL.
Inflight gauge. Should bounce around MAX_INFLIGHT * 0.7 — not pinned at the cap.

If you're already shipping OpenTelemetry traces, add a span attribute for backpressure decisions. When something goes sideways at 2 AM, you want to immediately see whether the gateway shed load or upstream hung.

Putting it together

The recipe that has worked across half a dozen production AI gateways:

Token bucket per API key at the edge — fairness.
Concurrency limit per gateway node — protection.
Bounded queue per worker pool with TTL — burst absorption.
Standard 429/503 with Retry-After — graceful degradation.
Metrics on queue depth, wait time, reject reasons — observability.

Pair backpressure with circuit breakers on the upstream side and you have a gateway that bends instead of breaks. Try it against EzAI's unified endpoint at ezaiapi.com — you get one API key, predictable rate limits, and the ability to fail over between providers when one of them starts shedding load on you.