EzAI
Back to Blog
Use Cases Feb 27, 2026 9 min read

Build a RAG Chatbot with Python and Claude API

E

EzAI Team

You want your chatbot to answer questions about your data — not just regurgitate training data from 2024. That's exactly what RAG (Retrieval Augmented Generation) does. Instead of fine-tuning, you build a RAG chatbot that fetches relevant documents at query time and feeds them to the LLM as context. It's cheaper, more accurate, and you can update your knowledge base without retraining anything.

RAG chatbot architecture diagram

RAG architecture — indexing pipeline (top) and query pipeline (bottom)

This guide walks you through building a working RAG pipeline in Python using Claude's API through EzAI. By the end, you'll have a chatbot that answers questions grounded in your own documents — with citations.

Build Rag Chatbot

How RAG Works (30-Second Version)

The retrieval augmented generation pipeline has three stages:

  1. Index — Split your documents into chunks, generate embeddings, store them in a vector database
  2. Retrieve — When a user asks a question, find the most relevant chunks using semantic search
  3. Generate — Feed those chunks to Claude as context, get a grounded answer

That's it. No model training, no GPU clusters, no ML expertise required. Just text processing and API calls.

Prerequisites and Setup

You'll need Python 3.10+, an EzAI API key, and three packages:

bash
pip install anthropic numpy chromadb

We're using ChromaDB as our vector store because it runs locally with zero config — great for prototyping. For production, swap it out for Pinecone, Weaviate, or pgvector. The ChromaDB docs cover setup if you want to explore further.

Step 1: Chunk Your Documents

LLMs have context windows, and embeddings work best on focused text. Split your documents into chunks of ~500 tokens with some overlap so you don't cut sentences mid-thought:

python
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # overlap keeps context between chunks

    return chunks

# Example: chunk a document
with open("docs/product-guide.md") as f:
    raw_text = f.read()

chunks = chunk_text(raw_text)
print(f"Split into {len(chunks)} chunks")

Step 2: Build the Vector Index with Embeddings

Now turn those chunks into vectors and store them. ChromaDB handles embedding generation internally using its default model, which keeps things simple. For production, you might want OpenAI's text-embedding-3-small or Cohere's embeddings via EzAI's OpenAI-compatible endpoint.

python
import chromadb

# Create a persistent local vector store
client = chromadb.PersistentClient(path="./vectordb")
collection = client.get_or_create_collection("knowledge_base")

# Add chunks with metadata for citation
collection.add(
    documents=chunks,
    ids=[f"chunk_{i}" for i in range(len(chunks))],
    metadatas=[{"source": "product-guide.md", "chunk_index": i} for i in range(len(chunks))]
)

print(f"Indexed {collection.count()} chunks")

You only run indexing once (or when documents change). The vector store persists to disk, so restarts are instant.

Step 3: Retrieve Relevant Context

When a user asks a question, query the vector store to find the most relevant chunks. ChromaDB uses cosine similarity under the hood:

python
def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Find the most relevant chunks for a query."""
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )

    # Pair documents with their metadata
    context_chunks = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_chunks.append({"text": doc, "source": meta["source"]})

    return context_chunks

# Test retrieval
results = retrieve("How do I reset my password?")
print(f"Found {len(results)} relevant chunks")

Step 4: Generate Answers with Claude

Now the core of the RAG chatbot — take the retrieved context and ask Claude to answer based on it. The system prompt is critical here. You want the model to cite its sources and refuse to hallucinate when the context doesn't contain the answer:

python
import anthropic

claude = anthropic.Anthropic(
    api_key="sk-your-ezai-key",
    base_url="https://ezaiapi.com"
)

SYSTEM_PROMPT = """You are a helpful assistant that answers questions based on
the provided context documents. Rules:
1. Only use information from the provided context
2. If the context doesn't contain the answer, say so honestly
3. Cite which source document your answer comes from
4. Be concise and direct"""

def ask(question: str) -> str:
    # 1. Retrieve relevant context
    chunks = retrieve(question, top_k=5)

    # 2. Format context for the prompt
    context = "\n\n".join(
        f"[Source: {c['source']}]\n{c['text']}"
        for c in chunks
    )

    # 3. Ask Claude with context
    response = claude.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )

    return response.content[0].text

# Try it
answer = ask("How do I reset my password?")
print(answer)

That's a working RAG chatbot in under 80 lines of Python. Ask it a question, it searches your docs, and Claude generates an answer grounded in your actual content.

Making It Production-Ready

The code above works, but production needs a few more things. Here are the most impactful improvements:

Add Conversation Memory

RAG chatbots need to remember earlier messages in the conversation. Pass the full chat history to Claude, but only retrieve context for the latest question:

python
def chat(question: str, history: list) -> str:
    chunks = retrieve(question, top_k=5)
    context = "\n\n".join(f"[{c['source']}] {c['text']}" for c in chunks)

    messages = history + [{
        "role": "user",
        "content": f"Context:\n{context}\n\nQuestion: {question}"
    }]

    response = claude.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=messages
    )

    answer = response.content[0].text
    # Append to history for next turn
    history.append({"role": "user", "content": question})
    history.append({"role": "assistant", "content": answer})
    return answer

Use Prompt Caching for Cost Savings

Your system prompt and frequently-retrieved chunks get sent over and over. Use prompt caching to cut input token costs by up to 90% on repeat calls — a huge win for chatbots with many concurrent users.

Pick the Right Model Per Query

Not every question needs Sonnet. Simple factual lookups ("What's the refund policy?") run great on Haiku at a fraction of the cost. Complex queries ("Compare our enterprise and startup plans and recommend one for a 50-person team") deserve Sonnet or Opus. See our cost optimization guide for routing strategies.

Common Pitfalls to Avoid

  • Chunks too large — Big chunks dilute relevance. Keep them under 500 words. If your answers cite irrelevant paragraphs, your chunks are too big.
  • No overlap — Without overlap, you'll split sentences and lose meaning at chunk boundaries. 50-100 word overlap fixes this.
  • Stuffing too many chunks — Retrieving 20 chunks means Claude has to read a wall of text. 3-5 chunks is the sweet spot for most use cases.
  • No "I don't know" — Without explicit instructions to refuse, Claude will try to be helpful and fabricate answers. Your system prompt must tell it to say "I don't have that information" when context is insufficient.
  • Ignoring metadata — Store source filename, page number, last-updated date with each chunk. This enables citations and lets you filter by freshness.

Where to Go from Here

You've got a working RAG chatbot. Here's how to level up:

  • Add streaming — Use Claude's streaming API for real-time token-by-token output. Way better UX for chat interfaces.
  • Hybrid search — Combine vector search with keyword search (BM25) for better retrieval. ChromaDB supports this natively.
  • Evaluation — Track retrieval precision and answer quality. If users ask follow-up questions, your first answer probably wasn't good enough.
  • Multiple doc types — Parse PDFs, HTML, and Markdown. Libraries like unstructured handle this well.

If you're new to EzAI, grab a free API key from the getting started guide — you get 15 free credits and access to free models to prototype your RAG pipeline before committing any budget. All the code in this post works with EzAI's Anthropic-compatible endpoint with zero modifications.


Related Posts