An AI agent without memory is just a stateless function. It answers your question, forgets your name, and starts every conversation from zero. That's fine for a one-shot demo and brutal for anything that needs to feel like a coworker, a tutor, or a long-running assistant.
This guide breaks down the three memory layers production agents actually use — short-term context, long-term vector recall, and episodic event logs — with working code you can paste into a project today. Examples use the EzAI API as a drop-in Anthropic-compatible endpoint.
The Three Memory Layers
Short-term lives in the prompt. Long-term lives in a vector index. Episodic lives in an append-only log.
Most agent bugs are memory bugs in disguise. The agent "forgot" because you didn't store it. The agent "lied" because the wrong fact made it back into the prompt. Naming each layer makes the design choices obvious:
- Short-term — the message array you pass to the model on every call. Limited by the context window, lost when the process restarts.
- Long-term — distilled facts and summaries stored in a vector database. Looked up by semantic similarity, injected into the prompt only when relevant.
- Episodic — a time-ordered append-only log of past turns and tool calls. Useful for replay, debugging, and "what did I tell you last Tuesday" queries.
Short-Term Memory: Sliding Window with Summarization
The naive approach is to append every turn to the message list and pray you don't blow the context window. A 200K window sounds infinite until your agent runs for an hour. The fix is a sliding window: keep the last N turns verbatim and replace older turns with a single summary message.
import anthropic
client = anthropic.Anthropic(
api_key="sk-your-key",
base_url="https://ezaiapi.com",
)
def trim_history(history, keep=10):
# Keep last `keep` messages verbatim, summarize the rest.
if len(history) <= keep:
return history
older = history[:-keep]
recent = history[-keep:]
transcript = "\n".join(f"{m['role']}: {m['content']}" for m in older)
summary = client.messages.create(
model="claude-haiku-4-5",
max_tokens=400,
messages=[{"role": "user",
"content": f"Summarize this conversation in 3 bullet points, preserving names, numbers, and decisions:\n\n{transcript}"}],
).content[0].text
return [{"role": "user", "content": f"[Earlier conversation summary]\n{summary}"}] + recent
Use a cheap model like claude-haiku-4-5 for the summarization pass — it's a meta-task, not customer-facing. Run the trim before every messages.create call once history exceeds your threshold. A good rule of thumb: keep 10 recent turns and let the summary absorb everything older.
Long-Term Memory: Vector Recall with pgvector
Summaries forget specifics. If a user told the agent their dog's name three weeks ago, you don't want that buried in a 400-token blob — you want it retrievable when they mention dogs. That's what long-term memory is for: store facts as embeddings, retrieve them by semantic similarity, inject the top results into the prompt.
On every turn: embed the user message, search the vector store, fold top-K results into the system prompt.
Schema is dead simple. Postgres + pgvector handles it without a separate service:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE agent_memory (
id BIGSERIAL PRIMARY KEY,
user_id TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON agent_memory USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX ON agent_memory (user_id, created_at DESC);
Write path: after each agent turn, call a small "memory extractor" that decides what's worth saving. Don't save everything — that's how you get noise. Save user preferences, decisions, names, and durable facts.
import psycopg, httpx
def embed(text):
# Embeddings via EzAI's OpenAI-compatible endpoint
r = httpx.post(
"https://ezaiapi.com/v1/embeddings",
headers={"Authorization": "Bearer sk-your-key"},
json={"model": "text-embedding-3-small", "input": text},
).json()
return r["data"][0]["embedding"]
def remember(user_id, fact, conn):
vec = embed(fact)
conn.execute(
"INSERT INTO agent_memory (user_id, content, embedding) VALUES (%s, %s, %s)",
(user_id, fact, vec),
)
def recall(user_id, query, conn, k=5):
vec = embed(query)
rows = conn.execute(
"""SELECT content FROM agent_memory
WHERE user_id = %s
ORDER BY embedding <=> %s::vector
LIMIT %s""",
(user_id, vec, k),
).fetchall()
return [r[0] for r in rows]
The <=> operator is pgvector's cosine distance. The IVFFlat index keeps queries under 10 ms even with millions of rows. Fold the recalled facts into the system prompt:
memories = recall(user_id="u_42", query=user_msg, conn=conn)
system = "You are a helpful assistant.\n\n"
if memories:
system += "What you remember about this user:\n" + "\n".join(f"- {m}" for m in memories)
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=system,
messages=trimmed_history,
)
Episodic Memory: Append-Only Event Log
Vector recall is great for "what do I know about this user" but useless for "what happened last Tuesday at 3pm". For that you need a flat, time-ordered log of every turn and tool call. This is also your debugging gold mine — when an agent does something weird in production, the episodic log is the first place you look.
CREATE TABLE agent_episodes (
id BIGSERIAL PRIMARY KEY,
session_id TEXT NOT NULL,
user_id TEXT NOT NULL,
role TEXT NOT NULL, -- user | assistant | tool
content JSONB NOT NULL,
model TEXT,
tokens_in INT,
tokens_out INT,
latency_ms INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON agent_episodes (session_id, created_at);
CREATE INDEX ON agent_episodes (user_id, created_at DESC);
Write to it on every turn. Never update, never delete — episodic memory is immutable by design. When you need to replay a session, SELECT * FROM agent_episodes WHERE session_id = ? ORDER BY created_at reconstructs the entire conversation including tool inputs and outputs. This pairs nicely with OpenTelemetry tracing for full observability.
Putting It Together
A turn in a memory-aware agent looks like this:
- Append user message to short-term history.
- Trim history with summarization if it's grown too long.
- Embed the user message and recall top-K long-term facts.
- Build the prompt: system + memories + trimmed history.
- Call the model.
- Write the user message and assistant reply to the episodic log.
- Optionally run a memory extractor on the turn and write new facts to long-term storage.
The whole pipeline is maybe 150 lines of glue code. The hard part is restraint: don't store everything in long-term memory, don't recall too aggressively, don't summarize too soon. Tune keep=N, top_k, and your extraction prompt against real conversations and watch the episodic log to see what the agent actually had in context.
What to Tune First
- Memory extractor prompt — too eager and you save noise; too strict and you miss what matters. Start with "save names, preferences, decisions, and durable facts only."
- Recall K — 3 to 7 is the sweet spot. More than that and the model starts treating recalls as more authoritative than the user's current message.
- Summary cadence — every 10 turns is conservative. If turns are short, push it to 20.
- Embedding model —
text-embedding-3-smallis fast, cheap, and good enough. Only upgrade to large if recall quality is provably the bottleneck.
Memory architecture is one of those things that looks like over-engineering until you ship without it and your users start complaining that the bot has amnesia. Start with short-term + episodic on day one, add long-term recall the first time someone says "I told you this already."
Next Steps
- Pair this with extended thinking for agents that reason about which memories matter.
- Use prompt caching on the system + memory block to cut costs when memory churn is low.
- See EzAI pricing for embedding and chat model rates.