EzAI
Back to Blog
Tutorial Feb 27, 2026 8 min read

How to Stream AI Responses in Real-Time

E

EzAI Team

How to Stream AI Responses in Real-Time

Waiting 10–30 seconds for an AI model to finish generating before displaying anything is a terrible user experience. Streaming fixes this by sending tokens to the client the moment they're generated — so your users see the response appear word by word, just like ChatGPT. This guide shows you exactly how to implement streaming with EzAI API using Python, Node.js, and raw HTTP.

Why Streaming Matters

Without streaming, a typical Claude Sonnet request with a 500-token response takes 3–8 seconds. Your user stares at a spinner the entire time. With streaming enabled, the first token arrives in under 200ms and the rest follow in real-time. The total generation time is identical, but the perceived latency drops by 90%.

Streaming also lets you build features that aren't possible otherwise:

  • Progressive rendering — show markdown/code as it's generated
  • Early cancellation — user can stop generation mid-response
  • Live token counting — display token usage as it accumulates
  • Typing indicators — show "AI is typing..." with real content

How SSE Streaming Works

AI APIs use Server-Sent Events (SSE) for streaming. When you set "stream": true in your request, instead of one big JSON response, the server sends a series of small events over a long-lived HTTP connection. Each event contains a chunk of the generated text.

The Anthropic Messages API (which EzAI is fully compatible with) sends these event types:

  • message_start — contains the message ID and model info
  • content_block_start — signals a new text block is beginning
  • content_block_delta — contains the actual generated text, chunk by chunk
  • content_block_stop — signals the text block is complete
  • message_delta — final usage stats (output tokens, stop reason)
  • message_stop — the stream is done
SSE streaming event flow diagram

Server-Sent Events flow: one request, many small responses streamed in real-time

Stream with curl — Quick Test

The fastest way to see streaming in action. Add "stream": true to your request body and watch the events arrive:

bash
curl --no-buffer https://ezaiapi.com/v1/messages \
  -H "x-api-key: sk-your-key" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "max_tokens": 256,
    "stream": true,
    "messages": [{"role": "user", "content": "Explain streaming in 3 sentences."}]
  }'

You'll see a rapid sequence of event: and data: lines. The --no-buffer flag is critical — without it, curl buffers the output and you won't see real-time chunks.

Stream with Python (Anthropic SDK)

The official Anthropic Python SDK has first-class streaming support. Point it at ezaiapi.com and use the .stream() context manager:

python
import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about APIs"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Get final message with usage stats
message = stream.get_final_message()
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")

The text_stream iterator yields each text chunk as it arrives. The flush=True ensures each chunk prints immediately instead of being buffered. After the stream completes, get_final_message() gives you the full message object with token counts.

Async streaming

For web servers and async apps, use the async client. Same API, just with await and async for:

python
import anthropic
import asyncio

client = anthropic.AsyncAnthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

async def main():
    async with client.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Explain WebSockets vs SSE"}]
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)

asyncio.run(main())

Stream with Node.js

The @anthropic-ai/sdk npm package supports streaming natively. Install it and point the base URL to EzAI:

javascript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "sk-your-key",
  baseURL: "https://ezaiapi.com",
});

const stream = client.messages.stream({
  model: "claude-sonnet-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Write a function to parse CSV" }],
});

stream.on("text", (text) => {
  process.stdout.write(text);
});

const message = await stream.finalMessage();
console.log(`\nTokens: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);

The Node.js SDK emits text events for each chunk. You can also listen for message, contentBlock, and error events for finer control.

Streaming in a Web App (FastAPI + SSE)

Here's a production-ready pattern for serving streamed AI responses to a browser. The backend proxies the stream from EzAI to the frontend using FastAPI's StreamingResponse:

python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com"
)

@app.post("/chat")
async def chat(prompt: str):
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"
            yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

On the frontend, consume the stream with EventSource or the Fetch API's reader.read() loop. The [DONE] sentinel tells the client the stream is finished.

Error Handling and Edge Cases

Streams can fail mid-response due to network issues, rate limits, or model timeouts. Always wrap your stream consumer in error handling:

  • Connection drops — retry with exponential backoff (the SDK handles this automatically)
  • Rate limits (429) — the error fires before any content events, so you can retry the whole request
  • Partial responses — if the stream dies mid-content, check stop_reason on the final message. If it's null, the response was truncated
  • Timeouts — set a client-side timeout (30–120s) and cancel the stream if it stalls
Streaming vs non-streaming latency comparison

Time-to-first-token: streaming delivers visible output 10–50× faster than waiting for the full response

Performance Tips

A few things that make streaming work better in production:

  1. Disable response buffering — make sure your reverse proxy (Nginx, Cloudflare) doesn't buffer SSE responses. In Nginx: proxy_buffering off;
  2. Use HTTP/2 — multiplexing lets you run multiple streams over one connection without head-of-line blocking
  3. Batch small chunks — on the frontend, use requestAnimationFrame to batch DOM updates instead of updating on every single token
  4. Track token usage — the message_delta event at the end of the stream gives you exact input/output token counts for cost tracking

When Not to Stream

Streaming isn't always the right call. Skip it when:

  • You need JSON output — parsing partial JSON is painful. Use non-streaming and parse the complete response
  • Background jobs — if no human is watching, streaming adds complexity for zero benefit
  • Very short responses — for classification or yes/no answers, the overhead of SSE isn't worth it

For everything else — chatbots, code generation, content writing, explanations — always stream. Your users will thank you.


Streaming through EzAI works identically to the official Anthropic API. Just change your base_url to https://ezaiapi.com and you're set. Check out the full API docs for advanced features like extended thinking with streaming, or get started if you haven't set up your account yet.


Related Posts