How to Test AI API Integrations: Mocks & CI

How to Test AI API Integrations: Mocks, Snapshots & CI

You built an AI feature, it works in dev, and you ship it. Two days later a model update changes the response format, your parser breaks, and users see a blank screen. No test caught it because you never tested AI integrations — you just eyeballed the output and moved on.

Testing AI APIs is different from testing a REST endpoint that returns deterministic JSON. Model responses drift. Latency varies. Rate limits hit at 2 AM. But that doesn't mean you skip tests. It means you build the right kind of tests. This guide covers a practical three-layer approach to testing AI API integrations that catches real bugs without burning your API budget.

The AI Testing Pyramid

Traditional testing pyramids apply to AI integrations, but with a twist. You need to account for non-deterministic outputs and expensive API calls. Here's how the layers break down:

AI API Testing Pyramid — Unit, Integration, and E2E layers

Three layers of testing confidence for AI API integrations

Unit tests cover prompt construction, response parsing, and token counting — the deterministic pieces. Integration tests use mock servers to simulate API responses, error codes, and edge cases. E2E tests hit the real API sparingly to verify the contract hasn't changed.

Layer 1: Unit Testing Prompts and Parsers

The parts of your AI integration that don't touch the network are fully testable. Prompt builders, response parsers, token estimators — these are plain functions with predictable outputs.

python

# test_prompts.py
import pytest
from myapp.ai import build_prompt, parse_response

def test_build_prompt_includes_system_context():
    prompt = build_prompt(
        task="summarize",
        content="Long article text here...",
        max_words=100
    )
    assert prompt[0]["role"] == "user"
    assert "summarize" in prompt[0]["content"].lower()
    assert "100 words" in prompt[0]["content"]

def test_parse_response_extracts_json():
    raw = """Here's the analysis:\n```json\n{"score": 8, "tags": ["python"]}\n```"""
    result = parse_response(raw)
    assert result["score"] == 8
    assert "python" in result["tags"]

def test_parse_response_handles_no_json():
    raw = "I couldn't analyze this content."
    result = parse_response(raw)
    assert result is None

These tests run in milliseconds, cost nothing, and catch the most common bugs: malformed prompts, broken parsers, and edge cases in response extraction. Run them on every commit.

Layer 2: Integration Tests with Mock Responses

For the network layer, mock the API instead of calling it. Record real responses once, then replay them in tests. This covers retry logic, error handling, and response processing without spending a cent on API calls.

With EzAI API, the response format matches the official Anthropic API exactly, so mocks you build work against both:

python

# test_integration.py
import pytest
import httpx
from pytest_httpx import HTTPXMock
from myapp.ai_client import AIClient

MOCK_RESPONSE = {
    "id": "msg_mock123",
    "type": "message",
    "role": "assistant",
    "content": [{"type": "text", "text": '{"score": 7, "summary": "Good code"}'}],
    "model": "claude-sonnet-4-5",
    "usage": {"input_tokens": 150, "output_tokens": 30}
}

def test_successful_request(httpx_mock: HTTPXMock):
    httpx_mock.add_response(
        url="https://ezaiapi.com/v1/messages",
        json=MOCK_RESPONSE
    )
    client = AIClient(base_url="https://ezaiapi.com", api_key="sk-test")
    result = client.analyze("Review this code")
    assert result.score == 7
    assert result.tokens_used == 180

def test_rate_limit_retry(httpx_mock: HTTPXMock):
    # First call: 429, second call: success
    httpx_mock.add_response(status_code=429, headers={"retry-after": "1"})
    httpx_mock.add_response(json=MOCK_RESPONSE)

    client = AIClient(base_url="https://ezaiapi.com", api_key="sk-test")
    result = client.analyze("Review this code")
    assert result.score == 7  # succeeded after retry

def test_overloaded_raises_after_retries(httpx_mock: HTTPXMock):
    for _ in range(4):
        httpx_mock.add_response(status_code=529)

    client = AIClient(base_url="https://ezaiapi.com", api_key="sk-test")
    with pytest.raises(AIOverloadedError):
        client.analyze("Review this code")

This pattern catches regressions in your retry logic, timeout handling, and error mapping — the stuff that breaks at 3 AM, not during demos.

Layer 3: Snapshot Testing for AI Outputs

AI responses aren't deterministic, but their structure should be. Snapshot testing captures the shape of a response and alerts you when it changes. It's cheaper than asserting exact content and more useful than no assertion at all.

python

# test_snapshots.py
import json

def extract_schema(obj, path=""):
    """Extract the type-structure from any nested object."""
    if isinstance(obj, dict):
        return {k: extract_schema(v, f"{path}.{k}") for k, v in obj.items()}
    elif isinstance(obj, list):
        return [extract_schema(obj[0], f"{path}[0]")] if obj else []
    return type(obj).__name__

def test_response_schema_stable(snapshot):
    # Call API once, save snapshot, then compare on future runs
    response = {
        "score": 8,
        "tags": ["python", "async"],
        "summary": "Clean implementation with good error handling",
        "suggestions": [{"line": 42, "fix": "Add type hints"}]
    }
    schema = extract_schema(response)
    # schema = {"score": "int", "tags": ["str"], "summary": "str", ...}
    assert schema == snapshot

When a model update changes "score" from an integer to a string, or drops the "suggestions" key entirely, snapshot tests catch it before your users do.

Running AI Tests in CI

The goal: unit tests and mock integration tests run on every push. Real API tests run on a schedule — daily or before releases — to catch contract changes without burning credits on every commit.

yaml

# .github/workflows/ai-tests.yml
name: AI Integration Tests

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM UTC

jobs:
  unit-and-mock:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements-test.txt
      - run: pytest tests/ -m "not live_api" --tb=short

  live-api:
    if: github.event_name == 'schedule'
    runs-on: ubuntu-latest
    env:
      EZAI_API_KEY: ${{ secrets.EZAI_API_KEY }}
      EZAI_BASE_URL: https://ezaiapi.com
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements-test.txt
      - run: pytest tests/ -m live_api --tb=long -v

Mark live tests with @pytest.mark.live_api so they only execute on schedule. This keeps CI fast on every push while still catching upstream API changes daily.

Testing Streaming Responses

Streaming adds complexity — your test needs to handle chunked responses. Here's how to mock a streaming endpoint using EzAI's Anthropic-compatible SSE format:

python

# test_streaming.py
import httpx
from unittest.mock import AsyncMock, patch

STREAM_CHUNKS = [
    b'event: content_block_delta\ndata: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}\n\n',
    b'event: content_block_delta\ndata: {"type":"content_block_delta","delta":{"type":"text_delta","text":" world"}}\n\n',
    b'event: message_stop\ndata: {"type":"message_stop"}\n\n',
]

async def test_stream_collects_text():
    mock_stream = AsyncMock()
    mock_stream.__aiter__ = AsyncMock(return_value=iter(STREAM_CHUNKS))

    with patch("httpx.AsyncClient.stream", return_value=mock_stream):
        client = AIClient(base_url="https://ezaiapi.com", api_key="sk-test")
        chunks = []
        async for chunk in client.stream("Say hello"):
            chunks.append(chunk)

        assert "".join(chunks) == "Hello world"

Cost-Aware Testing Practices

Every live API test costs tokens. A few practices keep your testing budget under control:

Use the cheapest model for tests. If you're testing your integration logic (not the model's intelligence), use claude-haiku-3-5 via EzAI instead of Opus. Haiku costs 96% less per token.
Cap max_tokens in tests. Set it to 100-200 for integration tests. You don't need a full response to verify the contract.
Cache test responses. Use prompt caching or local VCR-style cassettes to avoid repeated identical calls.
Run live tests on schedule, not on push. Daily is enough to catch upstream changes. If you need faster detection, use EzAI's webhook notifications for model updates.

With EzAI pricing, a full live test suite running once daily with Haiku typically costs under $0.50/month — less than a cup of coffee.

Putting It Together

Here's a minimal project structure that implements all three layers:

text

myapp/
├── ai_client.py          # AIClient with retry, parse, stream
├── prompts.py            # Prompt templates and builders
tests/
├── test_prompts.py       # Unit: prompt construction
├── test_parsers.py       # Unit: response parsing
├── test_integration.py   # Mock: retry, errors, timeouts
├── test_streaming.py     # Mock: SSE chunk handling
├── test_snapshots.py     # Schema: response shape
├── test_live.py          # Live: real API (scheduled)
├── conftest.py           # Fixtures: mock data, clients
└── cassettes/            # Recorded API responses
    └── analyze_code.json

Start with unit tests for your prompt builders and parsers. Add mock integration tests for error paths. Then wire up a single live test that runs daily to guard against upstream changes. You don't need 100% coverage on day one — even two or three targeted tests will save you from the "model update broke everything" surprise.

The whole point: build confidence that your AI features work without spending your entire API budget on test runs. Use EzAI's dashboard to track exactly what your test suite costs, and adjust from there.

How to Test AI API Integrations: Mocks, Snapshots & CI

The AI Testing Pyramid

Layer 1: Unit Testing Prompts and Parsers

Layer 2: Integration Tests with Mock Responses

Layer 3: Snapshot Testing for AI Outputs

Running AI Tests in CI

Testing Streaming Responses

Cost-Aware Testing Practices

Putting It Together

Related Posts

AI API Error Handling: Retries, Timeouts & Fallbacks

AI API Retry Strategies: Backoff, Jitter & Circuit Breakers