EzAI
Back to Blog
Tutorial Mar 25, 2026 8 min read

Build an AI API Gateway with Rate Limiting and Auth

E

EzAI Team

Build an AI API Gateway with Rate Limiting and Auth

Your team has five services calling Claude and GPT. Each one manages its own API key, retry logic, and error handling. Costs are scattered across dashboards. One rogue loop burns through $200 in tokens before anyone notices. Sound familiar? An AI API gateway solves all of this — a single entry point that handles authentication, rate limiting, caching, and routing before any request hits an AI provider.

In this tutorial, you'll build a lightweight gateway in Node.js that sits between your apps and EzAI's unified API. By the end, every AI request from your stack will flow through a single service with proper controls.

Why Build a Gateway?

Calling AI APIs directly from application code works fine for prototypes. In production, you hit problems fast:

  • Key sprawl — API keys hardcoded or scattered across .env files in 10 repos
  • No spend controls — a misconfigured batch job can drain your balance in minutes
  • Duplicate caching logic — every service reinvents response caching independently
  • Blind spots — no unified view of which service is spending what

A gateway centralizes all of this. Your apps send requests to your gateway using internal tokens. The gateway validates, rate-limits, caches, and forwards to EzAI. One key, one dashboard, one place to set guardrails.

AI API Gateway architecture showing request flow through auth, rate limiting, cache, and provider routing

Gateway request flow — auth → rate limit → cache → forward to AI provider

Project Setup

Initialize the project and install dependencies. We're using Express for routing, ioredis for caching and rate limit state, and node-fetch for upstream calls.

bash
mkdir ai-gateway && cd ai-gateway
npm init -y
npm i express ioredis node-fetch@3 crypto-js dotenv

Create a .env file with your EzAI key and Redis connection:

env
EZAI_API_KEY=sk-your-ezai-key
EZAI_BASE_URL=https://ezaiapi.com
REDIS_URL=redis://localhost:6379
PORT=4000

Authentication Middleware

Internal services authenticate against your gateway with their own tokens. This decouples your team's access from the upstream EzAI key — you can revoke a single service's access without rotating the master key.

javascript
// middleware/auth.js
const VALID_KEYS = new Map([
  ['gw-svc-frontend-abc123', { name: 'frontend', rateLimit: 100 }],
  ['gw-svc-backend-def456',  { name: 'backend',  rateLimit: 500 }],
  ['gw-svc-worker-ghi789',  { name: 'worker',   rateLimit: 50  }],
]);

export function authMiddleware(req, res, next) {
  const key = req.headers['x-gateway-key'];
  if (!key) return res.status(401).json({ error: 'Missing x-gateway-key header' });

  const client = VALID_KEYS.get(key);
  if (!client) return res.status(403).json({ error: 'Invalid API key' });

  req.client = client; // attach client metadata for downstream use
  next();
}

In production, you'd store keys in Redis or a database with hashed values. The hardcoded map works for this tutorial.

Token Bucket Rate Limiter

AI API calls are expensive. A token bucket rate limiter lets clients burst briefly but enforces a sustained ceiling. Each client gets a bucket that refills at a fixed rate — spend all your tokens on a burst, then wait for the bucket to refill.

Rate limiting strategies comparison table showing Token Bucket as recommended for AI API gateways

Token Bucket handles bursty LLM traffic better than fixed or sliding window approaches

javascript
// middleware/rateLimit.js
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);

export async function rateLimitMiddleware(req, res, next) {
  const { name, rateLimit: maxTokens } = req.client;
  const key = `rl:${name}`;
  const now = Date.now();
  const refillRate = maxTokens / 3600000; // tokens per ms (1 hour window)

  let bucket = await redis.hgetall(key);

  if (!bucket.tokens) {
    // First request — initialize full bucket
    await redis.hmset(key, { tokens: maxTokens, lastRefill: now });
    await redis.expire(key, 7200);
    return next();
  }

  // Refill tokens based on elapsed time
  const elapsed = now - Number(bucket.lastRefill);
  const refilled = Math.min(
    Number(bucket.tokens) + elapsed * refillRate,
    maxTokens
  );

  if (refilled < 1) {
    const retryMs = Math.ceil((1 - refilled) / refillRate);
    res.set('Retry-After', Math.ceil(retryMs / 1000));
    return res.status(429).json({
      error: 'Rate limit exceeded',
      retryAfterMs: retryMs
    });
  }

  await redis.hmset(key, { tokens: refilled - 1, lastRefill: now });
  next();
}

The Retry-After header tells clients exactly when to retry, so they don't hammer the gateway with blind retries. The bucket state lives in Redis, so it persists across gateway restarts and works with multiple gateway instances behind a load balancer.

Response Caching

Identical prompts happen more often than you'd think — especially in production apps with templated queries. Caching saves both latency and money. We hash the request body and store the response in Redis with a TTL.

javascript
// middleware/cache.js
import Redis from 'ioredis';
import { createHash } from 'crypto';
const redis = new Redis(process.env.REDIS_URL);
const CACHE_TTL = 3600; // 1 hour

function hashBody(body) {
  return createHash('sha256').update(JSON.stringify(body)).digest('hex');
}

export async function cacheMiddleware(req, res, next) {
  if (req.body?.stream) return next(); // skip streaming requests

  const cacheKey = `cache:${hashBody(req.body)}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    res.set('X-Cache', 'HIT');
    return res.json(JSON.parse(cached));
  }

  // Store original json method to intercept response
  const originalJson = res.json.bind(res);
  res.json = async (data) => {
    await redis.setex(cacheKey, CACHE_TTL, JSON.stringify(data));
    res.set('X-Cache', 'MISS');
    return originalJson(data);
  };
  next();
}

Streaming requests bypass the cache — you can't meaningfully cache SSE chunks. For non-streaming calls, the X-Cache header lets clients see whether they got a cached response.

Proxy Route to EzAI

The final piece: forward validated, rate-limited requests to EzAI's API and return the response. This is where your gateway swaps internal tokens for the real EzAI API key.

javascript
// server.js
import express from 'express';
import 'dotenv/config';
import { authMiddleware } from './middleware/auth.js';
import { rateLimitMiddleware } from './middleware/rateLimit.js';
import { cacheMiddleware } from './middleware/cache.js';

const app = express();
app.use(express.json({ limit: '1mb' }));

app.post('/v1/messages',
  authMiddleware,
  rateLimitMiddleware,
  cacheMiddleware,
  async (req, res) => {
    const start = Date.now();
    try {
      const upstream = await fetch(
        `${process.env.EZAI_BASE_URL}/v1/messages`,
        {
          method: 'POST',
          headers: {
            'x-api-key': process.env.EZAI_API_KEY,
            'anthropic-version': '2023-06-01',
            'content-type': 'application/json',
          },
          body: JSON.stringify(req.body),
        }
      );

      const data = await upstream.json();
      const latency = Date.now() - start;

      console.log(
        `[${req.client.name}] ${req.body.model}${upstream.status} (${latency}ms)`
      );

      res.status(upstream.status).json(data);
    } catch (err) {
      console.error(`Gateway error: ${err.message}`);
      res.status(502).json({ error: 'Upstream request failed' });
    }
  }
);

app.listen(process.env.PORT, () =>
  console.log(`AI Gateway running on :${process.env.PORT}`)
);

Every request gets logged with the client name, model, status, and latency. Pipe these logs to any aggregation tool for real-time visibility into your AI spend.

Testing the Gateway

Start the gateway and make a test call. Your internal services use the x-gateway-key header instead of the real API key:

bash
curl http://localhost:4000/v1/messages \
  -H "x-gateway-key: gw-svc-frontend-abc123" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Ping!"}]
  }'

Run the same request twice — the second call returns instantly with X-Cache: HIT. Try an invalid key and you'll get a clean 403. Exceed the rate limit and the gateway returns 429 with a Retry-After header your clients can respect.

Going Further

This foundation handles the critical 80%. Here's what to add for a fully production-ready gateway:

  • Per-model cost tracking — parse usage.input_tokens and usage.output_tokens from responses, multiply by EzAI's pricing, and store per-client cost totals in Redis
  • Spend alerts — set a daily budget per client, reject requests when exceeded, and fire a webhook to Slack
  • Streaming support — pipe ReadableStream from the upstream response directly to the client for SSE streaming
  • Model allowlists — restrict which clients can use expensive models like Claude Opus vs cheaper ones like Haiku
  • Request logging to a database — store every request for audit and debugging

The composable middleware pattern makes each feature a single file. Drop in a new middleware, add it to the chain, and deploy. No refactoring needed.

If you don't want to build the infrastructure from scratch, EzAI already provides many of these features out of the box — unified API keys, real-time cost dashboards, multi-model routing, and usage analytics. You can layer your own gateway on top for internal auth and custom rate limits, or use EzAI directly and skip the plumbing entirely.


Related Posts