How to Stream AI Responses in Real-Time

Waiting 10–30 seconds for an AI model to finish generating before displaying anything is a terrible user experience. Streaming fixes this by sending tokens to the client the moment they're generated — so your users see the response appear word by word, just like ChatGPT. This guide shows you exactly how to implement streaming with EzAI API using Python, Node.js, and raw HTTP.

Why Streaming Matters

Without streaming, a typical Claude Sonnet request with a 500-token response takes 3–8 seconds. Your user stares at a spinner the entire time. With streaming enabled, the first token arrives in under 200ms and the rest follow in real-time. The total generation time is identical, but the perceived latency drops by 90%.

Streaming also lets you build features that aren't possible otherwise:

Progressive rendering — show markdown/code as it's generated
Early cancellation — user can stop generation mid-response
Live token counting — display token usage as it accumulates
Typing indicators — show "AI is typing..." with real content

How SSE Streaming Works

AI APIs use Server-Sent Events (SSE) for streaming. When you set "stream": true in your request, instead of one big JSON response, the server sends a series of small events over a long-lived HTTP connection. Each event contains a chunk of the generated text.

The Anthropic Messages API (which EzAI is fully compatible with) sends these event types:

message_start — contains the message ID and model info
content_block_start — signals a new text block is beginning
content_block_delta — contains the actual generated text, chunk by chunk
content_block_stop — signals the text block is complete
message_delta — final usage stats (output tokens, stop reason)
message_stop — the stream is done

Server-Sent Events flow: one request, many small responses streamed in real-time

Stream with curl — Quick Test

The fastest way to see streaming in action. Add "stream": true to your request body and watch the events arrive:

bash

curl --no-buffer https://ezaiapi.com/v1/messages \
  -H "x-api-key: sk-your-key" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 256,
    "stream": true,
    "messages": [{"role": "user", "content": "Explain streaming in 3 sentences."}]
  }'

You'll see a rapid sequence of event: and data: lines. The --no-buffer flag is critical — without it, curl buffers the output and you won't see real-time chunks.

Stream with Python (Anthropic SDK)

The official Anthropic Python SDK has first-class streaming support. Point it at ezaiapi.com and use the .stream() context manager:

python

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about APIs"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Get final message with usage stats
message = stream.get_final_message()
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")

The text_stream iterator yields each text chunk as it arrives. The flush=True ensures each chunk prints immediately instead of being buffered. After the stream completes, get_final_message() gives you the full message object with token counts.

Async streaming

For web servers and async apps, use the async client. Same API, just with await and async for:

python

import anthropic
import asyncio

client = anthropic.AsyncAnthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

async def main():
    async with client.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Explain WebSockets vs SSE"}]
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)

asyncio.run(main())

Stream with Node.js

The @anthropic-ai/sdk npm package supports streaming natively. Install it and point the base URL to EzAI:

javascript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "sk-your-key",
  baseURL: "https://ezaiapi.com",
});

const stream = client.messages.stream({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Write a function to parse CSV" }],
});

stream.on("text", (text) => {
  process.stdout.write(text);
});

const message = await stream.finalMessage();
console.log(`\nTokens: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);

The Node.js SDK emits text events for each chunk. You can also listen for message, contentBlock, and error events for finer control.

Streaming in a Web App (FastAPI + SSE)

Here's a production-ready pattern for serving streamed AI responses to a browser. The backend proxies the stream from EzAI to the frontend using FastAPI's StreamingResponse:

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

@app.post("/chat")
async def chat(prompt: str):
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
            yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

On the frontend, consume the stream with EventSource or the Fetch API's reader.read() loop. The [DONE] sentinel tells the client the stream is finished.

Error Handling and Edge Cases

Streams can fail mid-response due to network issues, rate limits, or model timeouts. Always wrap your stream consumer in error handling:

Connection drops — retry with exponential backoff (the SDK handles this automatically)
Rate limits (429) — the error fires before any content events, so you can retry the whole request
Partial responses — if the stream dies mid-content, check stop_reason on the final message. If it's null, the response was truncated
Timeouts — set a client-side timeout (30–120s) and cancel the stream if it stalls

Streaming vs non-streaming latency comparison

Time-to-first-token: streaming delivers visible output 10–50× faster than waiting for the full response

Performance Tips

A few things that make streaming work better in production:

Disable response buffering — make sure your reverse proxy (Nginx, Cloudflare) doesn't buffer SSE responses. In Nginx: proxy_buffering off;
Use HTTP/2 — multiplexing lets you run multiple streams over one connection without head-of-line blocking
Batch small chunks — on the frontend, use requestAnimationFrame to batch DOM updates instead of updating on every single token
Track token usage — the message_delta event at the end of the stream gives you exact input/output token counts for cost tracking

When Not to Stream

Streaming isn't always the right call. Skip it when:

You need JSON output — parsing partial JSON is painful. Use non-streaming and parse the complete response
Background jobs — if no human is watching, streaming adds complexity for zero benefit
Very short responses — for classification or yes/no answers, the overhead of SSE isn't worth it

For everything else — chatbots, code generation, content writing, explanations — always stream. Your users will thank you.

Streaming through EzAI works identically to the official Anthropic API. Just change your base_url to https://ezaiapi.com and you're set. Check out the full API docs for advanced features like extended thinking with streaming, or get started if you haven't set up your account yet.

How to Stream AI Responses in Real-Time

Why Streaming Matters

How SSE Streaming Works

Stream with curl — Quick Test

Stream with Python (Anthropic SDK)

Async streaming

Stream with Node.js

Streaming in a Web App (FastAPI + SSE)

Error Handling and Edge Cases

Performance Tips

When Not to Stream

Related Posts

Getting Started with EzAI API in 5 Minutes

AI Extended Thinking: When It Helps and When It Wastes Tokens