EzAI
Back to Blog
Tips Apr 15, 2026 8 min read

AI API Idempotency: Never Pay for Duplicate Requests

E

EzAI Team

AI API Idempotency: Never Pay for Duplicate Requests

A user clicks "Generate" twice. Your retry logic fires after a timeout. A load balancer replays a request during failover. Each of these sends the same prompt to the AI API — and each one costs you tokens. Idempotency is how you stop paying for the same work twice. This guide walks through building an idempotency layer in Python with Redis that deduplicates AI API calls, returns cached responses for repeated requests, and handles race conditions cleanly.

Why AI APIs Need Idempotency

Traditional REST APIs are cheap per call. A duplicate GET /users costs microseconds. But AI API calls are expensive — a single Claude Opus request with 4K input tokens and 2K output can cost $0.15+. Multiply that by retry storms, double-clicks, or frontend re-renders, and you're burning real money on identical work.

The math is straightforward. If 5% of your requests are unintentional duplicates and you make 10,000 calls per day at an average cost of $0.03 each, that's $15/day — $450/month — wasted on responses you already had.

Unlike traditional caching, idempotency operates on request identity, not response freshness. Two requests with identical inputs within a time window should return the same response, regardless of cache headers or TTL policies.

The Idempotency Key Pattern

The core idea: generate a deterministic key from the request payload, check Redis before calling the API, and store the response after a successful call. Here's the building block:

python
import hashlib, json, redis, anthropic

r = redis.Redis(host="localhost", port=6379, db=0)
client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

def make_idempotency_key(model, messages, max_tokens, system=None):
    """Deterministic key from request params."""
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "system": system or "",
    }
    raw = json.dumps(payload, sort_keys=True)
    return f"idem:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"

The key is a SHA-256 hash of the serialized request body. We truncate to 16 hex characters (64 bits) — collision probability stays negligible for typical workloads, and shorter keys mean faster Redis lookups.

Idempotency request flow diagram

Request flow: check Redis → hit returns cached response, miss acquires lock → calls API → stores result

Handling Race Conditions with Locks

The naive check-then-call pattern has a gap: two identical requests arrive within milliseconds, both see a cache miss, and both call the API. You need a distributed lock to serialize concurrent duplicates:

python
import time

def idempotent_create(model, messages, max_tokens, system=None, ttl=3600):
    key = make_idempotency_key(model, messages, max_tokens, system)
    lock_key = f"{key}:lock"

    # 1. Check cache
    cached = r.get(key)
    if cached:
        return json.loads(cached)  # Cache hit — free!

    # 2. Acquire lock (NX = only if not exists, EX = 120s timeout)
    acquired = r.set(lock_key, "1", nx=True, ex=120)

    if not acquired:
        # Another request is in-flight — wait for result
        for _ in range(60):
            time.sleep(1)
            cached = r.get(key)
            if cached:
                return json.loads(cached)
        raise TimeoutError("Idempotency lock timed out")

    try:
        # 3. Call the AI API
        response = client.messages.create(
            model=model,
            max_tokens=max_tokens,
            system=system or "",
            messages=messages,
        )

        # 4. Cache the response
        result = {
            "content": response.content[0].text,
            "model": response.model,
            "usage": {
                "input": response.usage.input_tokens,
                "output": response.usage.output_tokens,
            },
            "cached_at": time.time(),
        }
        r.setex(key, ttl, json.dumps(result))
        return result

    finally:
        r.delete(lock_key)

The lock has a 120-second expiry as a safety net — if the process crashes mid-call, the lock auto-releases instead of blocking future requests forever. The waiting loop polls every second, which is fine for AI calls that typically take 2–30 seconds.

Choosing the Right TTL

TTL (time-to-live) depends on how your application uses AI responses:

  • Code generation / analysis: 1–4 hours. Same code + same prompt = same review. Set ttl=3600.
  • Chat / conversational: 5–15 minutes. Users expect fresh responses if they retry. Set ttl=300.
  • Batch processing: 24 hours+. You're processing a dataset — duplicates should always return cached. Set ttl=86400.
  • Retry-only dedup: 30–60 seconds. Only catch actual retries, not intentional re-asks. Set ttl=60.

Integrating with EzAI API

Here's a complete wrapper class that plugs into any existing codebase using the Anthropic SDK with EzAI API:

python
import hashlib, json, time, logging
import redis, anthropic

log = logging.getLogger("idempotent_ai")

class IdempotentAI:
    def __init__(self, api_key, redis_url="redis://localhost:6379/0",
                 default_ttl=3600):
        self.client = anthropic.Anthropic(
            api_key=api_key,
            base_url="https://ezaiapi.com",
        )
        self.redis = redis.from_url(redis_url)
        self.default_ttl = default_ttl
        self.stats = {"hits": 0, "misses": 0, "saved_tokens": 0}

    def create(self, model, messages, max_tokens=1024,
               system=None, ttl=None, bypass=False):
        if bypass:
            return self._call_api(model, messages, max_tokens, system)

        ttl = ttl or self.default_ttl
        key = self._make_key(model, messages, max_tokens, system)

        # Check cache
        cached = self.redis.get(key)
        if cached:
            self.stats["hits"] += 1
            data = json.loads(cached)
            self.stats["saved_tokens"] += data["usage"]["input"] + data["usage"]["output"]
            log.info(f"Idempotency HIT: {key[:20]} (saved {data['usage']['input']+data['usage']['output']} tokens)")
            return data

        self.stats["misses"] += 1
        result = self._call_api(model, messages, max_tokens, system)
        self.redis.setex(key, ttl, json.dumps(result))
        return result

    def _call_api(self, model, messages, max_tokens, system):
        resp = self.client.messages.create(
            model=model, messages=messages,
            max_tokens=max_tokens, system=system or "",
        )
        return {
            "content": resp.content[0].text,
            "model": resp.model,
            "usage": {"input": resp.usage.input_tokens,
                     "output": resp.usage.output_tokens},
        }

    def _make_key(self, model, messages, max_tokens, system):
        raw = json.dumps({
            "m": model, "msg": messages,
            "mt": max_tokens, "s": system or "",
        }, sort_keys=True)
        return f"idem:{hashlib.sha256(raw.encode()).hexdigest()[:16]}"

# Usage
ai = IdempotentAI(api_key="sk-your-key")

# First call — hits the API
r1 = ai.create("claude-sonnet-4-5", [{"role": "user", "content": "Explain TCP handshake"}])

# Second call — same params, returns cached (zero cost)
r2 = ai.create("claude-sonnet-4-5", [{"role": "user", "content": "Explain TCP handshake"}])

print(ai.stats)
# {'hits': 1, 'misses': 1, 'saved_tokens': 847}

The bypass=True flag lets you skip idempotency for specific calls — useful for conversational flows where repeated questions should genuinely get fresh answers.

Monitoring Your Savings

Track hit rates to quantify how much idempotency saves you. Add this endpoint to your cost dashboard:

python
# Add to your FastAPI/Flask app
@app.get("/metrics/idempotency")
def idempotency_metrics():
    total = ai.stats["hits"] + ai.stats["misses"]
    hit_rate = (ai.stats["hits"] / total * 100) if total else 0
    # Rough cost estimate: $3/MTok input, $15/MTok output for Opus
    saved_cost = ai.stats["saved_tokens"] * 0.000009
    return {
        "hit_rate_pct": round(hit_rate, 1),
        "total_requests": total,
        "cache_hits": ai.stats["hits"],
        "tokens_saved": ai.stats["saved_tokens"],
        "est_cost_saved_usd": round(saved_cost, 4),
    }

Edge Cases to Handle

Production idempotency layers need to account for a few gotchas:

  • Temperature > 0: If you're using temperature=0.7, identical inputs should produce different outputs. Only use idempotency with temperature=0 or accept that cached responses won't reflect randomness. For most production use cases (code gen, analysis, extraction), temperature=0 is what you want anyway.
  • Streaming responses: You can't cache a stream mid-flight. Accumulate the full response, then cache it. Subsequent requests return the cached full response instantly — which is actually faster than streaming.
  • Redis eviction: Set maxmemory-policy allkeys-lru so Redis evicts least-recently-used keys when memory fills up. Idempotency keys are small (the response text is the bulk), so 1GB of Redis holds tens of thousands of cached responses.
  • Multi-region: If you run multiple regions, use a shared Redis cluster or accept per-region dedup. For most teams, per-region is fine — duplicates across regions are rare.

When Not to Use Idempotency

Skip the idempotency layer when:

  • The AI call is part of an interactive chat and the user explicitly re-asked
  • You're using high temperature for creative generation and need variety
  • The response includes tool calls that trigger side effects (e.g., function calling that creates database records)
  • You're already using prompt caching and the input token savings are sufficient

Idempotency and prompt caching solve different problems. Prompt caching reduces input token costs by reusing prefixes on the provider side. Idempotency prevents entire duplicate calls on your side. They stack well together — use both when it makes sense.

Wrapping Up

Adding idempotency to your AI API calls takes about 50 lines of Python and a Redis instance. The payoff is immediate: zero wasted spend on duplicate requests, faster responses for repeated queries, and cleaner retry logic throughout your stack. Start with a 1-hour TTL, monitor your hit rate on the EzAI dashboard, and adjust from there.

If you're running AI calls at scale, this is the lowest-effort, highest-impact optimization you can ship this week.


Related Posts