AI API Feature Flags: Safe LLM Rollouts in Production

You ship a prompt change on Friday. By Monday, support tickets are up 40%, your token bill doubled, and nobody can pinpoint when it broke. This is what shipping AI features without flags looks like.

Feature flags are routine for backend services, but most teams skip them for LLM calls — and then ship every prompt edit, model swap, and parameter tweak straight to 100% of users. This guide covers three rollout patterns built specifically for AI APIs, with code that works against the EzAI endpoint today.

Why Feature Flag Your AI Calls

LLM behavior is non-deterministic. A prompt change that looks identical in your test set can blow up on production traffic distribution. A new model version may be cheaper per token but hallucinate more on your domain. Temperature changes ripple through downstream parsers in ways unit tests don't catch.

Three things go wrong without flags:

Quality regressions — output format breaks, tool-calling reliability drops, refusal rates spike
Cost surprises — a "smarter" model uses 3x more tokens for the same task
Latency cliffs — extended thinking modes add 5-30s p95 you didn't model

Feature flags decouple deploy from release. You ship the new code path on Monday, enable it for 1% of traffic Tuesday, watch dashboards, and roll forward or back without redeploying.

Implementation Pattern

Keep it small. You don't need LaunchDarkly for an AI flag — a config dict and a hash function get you 80% of the value. Here's the core pattern in Python:

python

import hashlib, os
from anthropic import Anthropic

client = Anthropic(
    api_key=os.environ["EZAI_KEY"],
    base_url="https://ezaiapi.com",
)

FLAGS = {
    "new_summarizer_model": {
        "control": "claude-haiku-4-5",
        "variant": "claude-sonnet-4-5",
        "rollout_pct": 5,  # 0-100
    }
}

def in_variant(flag: str, user_id: str) -> bool:
    cfg = FLAGS[flag]
    h = int(hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest(), 16)
    return (h % 100) < cfg["rollout_pct"]

def summarize(text: str, user_id: str) -> str:
    cfg = FLAGS["new_summarizer_model"]
    model = cfg["variant"] if in_variant("new_summarizer_model", user_id) else cfg["control"]
    msg = client.messages.create(
        model=model,
        max_tokens=512,
        messages=[{"role": "user", "content": f"Summarize: {text}"}],
        metadata={"user_id": user_id, "flag_variant": model},
    )
    return msg.content[0].text

Three things matter here. First, the hash uses flag:user_id so the same user gets sticky assignment across requests — flipping people in and out of the variant generates noisy metrics. Second, the model selection happens once per call, close to the API boundary. Third, you tag the request with metadata.flag_variant so your dashboard can split metrics by group.

Rollout Strategies

Pick the rollout pattern that matches your risk profile and how reversible the change is

Shadow (0% user impact)

Run the new model in parallel with the old one, return the old response, and log both. You see real production traffic on the new model with zero user-visible risk. Use this when you're evaluating a model swap and want comparable outputs side-by-side.

python

import asyncio

async def shadow_call(prompt: str):
    control = client.messages.create(model="claude-haiku-4-5", max_tokens=512,
                                     messages=[{"role": "user", "content": prompt}])
    # fire-and-forget shadow request — log result, never serve to user
    asyncio.create_task(log_shadow(prompt, "gpt-4.1"))
    return control

Cost note: shadow doubles your API spend for the shadowed traffic. Sample it — 5-10% of requests is usually enough to spot regressions. Shadow testing for AI models goes deeper on offline diff scoring.

Canary (1-10%)

Send a small slice of real traffic to the new variant. Monitor error rate, p95 latency, and cost-per-request against the control. Most production AI rollouts should start here.

What to monitor in the first hour:

HTTP error rate — should match control within 0.1pp
Token usage delta — flag if variant uses >20% more tokens
Output validity — for structured output, check JSON parse rate
Downstream signals — thumbs-down rate, retry rate, support contacts

Wire automatic rollback. If error rate doubles or cost-per-call jumps, the canary should disable itself without a human in the loop. Pair with our SLO and error budget guide for thresholds.

Ramped (25 → 50 → 75 → 100)

After the canary holds, ramp in stages with a hold period at each step (24h is a reasonable default for B2B traffic, 1-4h for high-volume B2C). The hold catches issues that only appear at scale or under specific traffic patterns — late-night batch jobs, weekend usage, edge-case prompts.

bash

# quick A/B sanity check via curl against EzAI
curl "https://ezaiapi.com/v1/messages" \
  -H "x-api-key: $EZAI_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 200,
    "metadata": {"flag_variant": "sonnet-4-5", "rollout_stage": "canary-5pct"},
    "messages": [{"role": "user", "content": "ping"}]
  }'

Pitfalls to Avoid

Don't flag on request ID. Each request gets a fresh assignment, your metrics turn into noise, and a single user sees inconsistent behavior. Always hash on a stable identity (user, org, session).

Don't compare cost in dollars during the rollout. Compare tokens-per-request and latency separately. If you see "variant costs 2x" in your dashboard, that's signal — but it might be a pricing difference, not a behavior change.

Clean up old flags. A flag that's been at 100% for 30 days is dead code with extra branches. Delete it. Stale flags compound and the next person reading the code will assume the gate still matters.

Test the rollback path. Set the flag to 0% in staging once a week. If the rollback toggle hasn't fired in production for 6 months, it's probably broken.

Wrap-up

Feature flags for AI calls are not over-engineering — they're the cheapest insurance against a class of incidents that will absolutely happen as you ship more LLM features. Start with shadow when evaluating a model, graduate to canary for the rollout, ramp to 100% with holds, then tear the flag down.

EzAI gives you one endpoint to swap between Claude, GPT, Gemini, and Grok behind the same flag, so you can run multi-provider experiments without juggling SDKs. Grab a key if you don't have one — new accounts get 15 free credits to run your first canary. The Martin Fowler write-up on feature toggles is still the best general primer if you want deeper background on flag taxonomy and lifecycle.