Cache AI API Responses to Cut Costs by 60%

Every time your app sends the same prompt to Claude or GPT, you're paying full price for a response you already have. AI API caching is the single most effective way to reduce costs without changing your model or degrading quality. Teams that implement it typically see 40–70% cost reduction on their first month.

This guide covers three caching strategies — from a simple hash-based cache you can build in 20 minutes, to semantic caching that catches near-duplicate prompts your users don't even realize they're sending.

Why Most AI Costs Are Wasted

Analyze your API logs and you'll find a pattern: a huge chunk of requests are effectively duplicates. Customer support bots answering the same FAQ. Code assistants explaining the same error. Translation services processing identical strings.

Here's what a typical production workload looks like:

30-40% of prompts are exact duplicates (same user, same input)
15-25% are near-duplicates (different wording, same intent)
Only 35-55% are genuinely unique requests

That means up to 65% of your AI spend is going toward answers you've already generated. A cache fixes this instantly.

Strategy 1: Hash-Based Exact Cache

The simplest approach — hash the prompt + model + parameters, store the response, return it on match. Zero false positives, zero complexity. This alone catches 30-40% of traffic for most apps.

python

import hashlib, json, redis
import anthropic

cache = redis.Redis(host="localhost", port=6379, db=0)
client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

def cache_key(model, messages, max_tokens):
    """Generate a deterministic cache key from request params."""
    payload = json.dumps({
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens
    }, sort_keys=True)
    return f"ai:cache:{hashlib.sha256(payload.encode()).hexdigest()[:16]}"

def ask_ai(prompt, model="claude-sonnet-4-5", max_tokens=1024, ttl=3600):
    messages = [{"role": "user", "content": prompt}]
    key = cache_key(model, messages, max_tokens)

    # Check cache first
    cached = cache.get(key)
    if cached:
        return json.loads(cached)  # Free! No API call

    # Cache miss — call the API
    response = client.messages.create(
        model=model, max_tokens=max_tokens, messages=messages
    )
    result = response.content[0].text

    # Store with TTL
    cache.setex(key, ttl, json.dumps(result))
    return result

This is production-ready in 20 lines. The key insight is using sort_keys=True in the JSON serialization so that {"role": "user", "content": "hi"} and {"content": "hi", "role": "user"} produce the same hash.

Strategy 2: Prompt Normalization

Prompt normalization pipeline showing whitespace trimming, lowercasing, and hash generation

Normalization pipeline — clean prompts before hashing to increase cache hit rate

Users phrase the same question differently. "What is Python?" and "what is python ?" are functionally identical but produce different hashes. Normalization fixes this by cleaning prompts before hashing:

python

import re

def normalize_prompt(text):
    """Normalize prompt for better cache hits."""
    text = text.strip().lower()
    text = re.sub(r'\s+', ' ', text)  # collapse whitespace
    text = re.sub(r'[.!?]+$', '', text)  # strip trailing punctuation
    return text

# These all become the same cache key:
# "What is Python?"    → "what is python"
# "what is  python"    → "what is python"
# "What is Python!!"   → "what is python"

def ask_ai_normalized(prompt, model="claude-sonnet-4-5", **kwargs):
    normalized = normalize_prompt(prompt)
    messages = [{"role": "user", "content": normalized}]
    key = cache_key(model, messages, kwargs.get("max_tokens", 1024))

    cached = cache.get(key)
    if cached:
        return json.loads(cached)

    # Send original prompt for best output quality
    response = client.messages.create(
        model=model, max_tokens=kwargs.get("max_tokens", 1024),
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text
    cache.setex(key, 3600, json.dumps(result))
    return result

Important detail: normalize the prompt for the cache key only. Send the original prompt to the API so the model gets proper capitalization and punctuation. This typically adds another 10-15% cache hit rate on top of exact matching.

Strategy 3: Semantic Caching with Embeddings

This is the heavy hitter. Instead of matching exact strings, you compare the meaning of prompts using vector embeddings. "How do I sort a list in Python?" and "Python list sorting tutorial" have completely different text but identical intent.

python

import numpy as np
from sentence_transformers import SentenceTransformer
import anthropic, json, redis

# Local embedding model — fast, free, no API calls
embedder = SentenceTransformer("all-MiniLM-L6-v2")
cache = redis.Redis(host="localhost", port=6379, db=1)
client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

SIMILARITY_THRESHOLD = 0.92  # Tune this — higher = fewer false positives

def get_embedding(text):
    return embedder.encode(text, normalize_embeddings=True)

def cosine_sim(a, b):
    return float(np.dot(a, b))

def semantic_lookup(prompt_embedding):
    """Find the closest cached prompt by meaning."""
    best_score, best_key = 0, None
    for key in cache.scan_iter("sem:emb:*"):
        stored = np.frombuffer(cache.get(key), dtype=np.float32)
        score = cosine_sim(prompt_embedding, stored)
        if score > best_score:
            best_score, best_key = score, key
    if best_score >= SIMILARITY_THRESHOLD:
        resp_key = best_key.decode().replace("sem:emb:", "sem:resp:")
        return json.loads(cache.get(resp_key))
    return None

def ask_ai_semantic(prompt, model="claude-sonnet-4-5"):
    embedding = get_embedding(prompt)
    cached = semantic_lookup(embedding)
    if cached:
        return cached  # Semantic hit!

    response = client.messages.create(
        model=model, max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    result = response.content[0].text

    # Store embedding + response
    uid = hashlib.md5(prompt.encode()).hexdigest()[:12]
    cache.setex(f"sem:emb:{uid}", 86400, embedding.tobytes())
    cache.setex(f"sem:resp:{uid}", 86400, json.dumps(result))
    return result

For production, swap the linear scan with a vector database like FAISS or Qdrant. The scan approach works fine up to ~10K cached entries, after which you'll want indexed search.

Choosing the Right TTL

Not all cached responses should live forever. Your TTL (time-to-live) strategy depends on how static the information is:

Factual / reference queries (what is X, how to Y) → 24-72 hours. These don't change often.
Code generation → 1-6 hours. Libraries update, best practices shift.
News / time-sensitive → 15-30 minutes. Stale data is worse than no cache.
Creative / varied outputs → Don't cache. Users expect different results each time.

A good default is 1 hour for general-purpose caches. You can always tune it per endpoint once you have traffic data.

Production Architecture

Caching architecture diagram showing request flow through normalization, hash check, semantic check, and API fallback

Full caching pipeline — normalize → exact match → semantic match → API call

In production, layer all three strategies together. The request flows through:

Normalize — clean whitespace, case, punctuation
Exact hash check — fastest, O(1) lookup in Redis
Semantic check — catches near-duplicates the hash missed
API call — only for genuinely new prompts

Each layer is a fallback. Most requests get caught at step 2 (cheap), some at step 3 (slightly more expensive due to embedding), and only unique queries hit the API.

Measuring Your Cache Performance

Track these three metrics to know if your cache is working:

python

from dataclasses import dataclass, field

@dataclass
class CacheMetrics:
    hits: int = 0
    misses: int = 0
    saved_dollars: float = 0.0

    @property
    def hit_rate(self):
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0

    def record_hit(self, est_cost=0.003):
        self.hits += 1
        self.saved_dollars += est_cost

    def report(self):
        return {
            "hit_rate": f"{self.hit_rate:.1%}",
            "total_saved": f"${self.saved_dollars:.2f}",
            "cached_responses": self.hits
        }

metrics = CacheMetrics()
# After 1 week: {"hit_rate": "58.3%", "total_saved": "$127.40", ...}

Monitor your hit rate daily. Below 30%? Your TTL might be too short or your traffic is highly diverse. Above 70%? You might be serving stale responses — consider shorter TTLs for time-sensitive endpoints.

When NOT to Cache

Caching isn't always the right call. Skip it when:

Temperature > 0 and users expect varied responses (creative writing, brainstorming)
User context matters — personalized responses based on conversation history
Real-time data — if the prompt references "today" or "current," cached results go stale fast
Security-sensitive — don't serve User A's cached medical/financial advice to User B

The fix for user-specific caching is simple: include the user ID in your cache key. That way each user has their own cache namespace.

Quick Wins to Start Today

You don't need a full semantic caching pipeline to start saving money. Here's a 30-minute implementation plan:

Add Redis — docker run -d redis:alpine if you don't have one
Wrap your API calls with the hash-based cache from Strategy 1
Add normalization from Strategy 2 for another 10-15% boost
Monitor hit rates with the metrics class above
Tune TTLs per endpoint based on a week of data

This gets you 40-60% cost reduction with minimal effort. Add semantic caching later when you have the traffic data to justify the complexity.

Caching works especially well with other cost reduction strategies like model routing and prompt optimization. Stack them together and you can easily cut your AI API bill by 80% or more. If you're running through EzAI's API, the savings stack on top of already-reduced pricing — meaning your effective per-token cost drops to a fraction of direct API pricing.

Cache AI API Responses to Cut Costs by 60%

Why Most AI Costs Are Wasted

Strategy 1: Hash-Based Exact Cache

Strategy 2: Prompt Normalization

Strategy 3: Semantic Caching with Embeddings

Choosing the Right TTL

Production Architecture

Measuring Your Cache Performance

When NOT to Cache

Quick Wins to Start Today

Related Posts

7 Ways to Reduce AI API Costs Without Losing Quality

AI Model Routing: Pick the Right Model for Every Task