EzAI
Back to Blog
Tutorial Apr 21, 2026 9 min read

AI Agent Memory: Short-Term, Long-Term, and Episodic Patterns

E

EzAI Team

AI Agent Memory: Short-Term, Long-Term, and Episodic Patterns

An AI agent without memory is just a stateless function. It answers your question, forgets your name, and starts every conversation from zero. That's fine for a one-shot demo and brutal for anything that needs to feel like a coworker, a tutor, or a long-running assistant.

This guide breaks down the three memory layers production agents actually use — short-term context, long-term vector recall, and episodic event logs — with working code you can paste into a project today. Examples use the EzAI API as a drop-in Anthropic-compatible endpoint.

The Three Memory Layers

Three memory layers in an AI agent

Short-term lives in the prompt. Long-term lives in a vector index. Episodic lives in an append-only log.

Most agent bugs are memory bugs in disguise. The agent "forgot" because you didn't store it. The agent "lied" because the wrong fact made it back into the prompt. Naming each layer makes the design choices obvious:

  • Short-term — the message array you pass to the model on every call. Limited by the context window, lost when the process restarts.
  • Long-term — distilled facts and summaries stored in a vector database. Looked up by semantic similarity, injected into the prompt only when relevant.
  • Episodic — a time-ordered append-only log of past turns and tool calls. Useful for replay, debugging, and "what did I tell you last Tuesday" queries.

Short-Term Memory: Sliding Window with Summarization

The naive approach is to append every turn to the message list and pray you don't blow the context window. A 200K window sounds infinite until your agent runs for an hour. The fix is a sliding window: keep the last N turns verbatim and replace older turns with a single summary message.

python
import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-key",
    base_url="https://ezaiapi.com",
)

def trim_history(history, keep=10):
    # Keep last `keep` messages verbatim, summarize the rest.
    if len(history) <= keep:
        return history

    older = history[:-keep]
    recent = history[-keep:]
    transcript = "\n".join(f"{m['role']}: {m['content']}" for m in older)

    summary = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=400,
        messages=[{"role": "user",
                   "content": f"Summarize this conversation in 3 bullet points, preserving names, numbers, and decisions:\n\n{transcript}"}],
    ).content[0].text

    return [{"role": "user", "content": f"[Earlier conversation summary]\n{summary}"}] + recent

Use a cheap model like claude-haiku-4-5 for the summarization pass — it's a meta-task, not customer-facing. Run the trim before every messages.create call once history exceeds your threshold. A good rule of thumb: keep 10 recent turns and let the summary absorb everything older.

Long-Term Memory: Vector Recall with pgvector

Summaries forget specifics. If a user told the agent their dog's name three weeks ago, you don't want that buried in a 400-token blob — you want it retrievable when they mention dogs. That's what long-term memory is for: store facts as embeddings, retrieve them by semantic similarity, inject the top results into the prompt.

Agent memory retrieval flow

On every turn: embed the user message, search the vector store, fold top-K results into the system prompt.

Schema is dead simple. Postgres + pgvector handles it without a separate service:

sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE agent_memory (
    id          BIGSERIAL PRIMARY KEY,
    user_id     TEXT NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON agent_memory USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX ON agent_memory (user_id, created_at DESC);

Write path: after each agent turn, call a small "memory extractor" that decides what's worth saving. Don't save everything — that's how you get noise. Save user preferences, decisions, names, and durable facts.

python
import psycopg, httpx

def embed(text):
    # Embeddings via EzAI's OpenAI-compatible endpoint
    r = httpx.post(
        "https://ezaiapi.com/v1/embeddings",
        headers={"Authorization": "Bearer sk-your-key"},
        json={"model": "text-embedding-3-small", "input": text},
    ).json()
    return r["data"][0]["embedding"]

def remember(user_id, fact, conn):
    vec = embed(fact)
    conn.execute(
        "INSERT INTO agent_memory (user_id, content, embedding) VALUES (%s, %s, %s)",
        (user_id, fact, vec),
    )

def recall(user_id, query, conn, k=5):
    vec = embed(query)
    rows = conn.execute(
        """SELECT content FROM agent_memory
           WHERE user_id = %s
           ORDER BY embedding <=> %s::vector
           LIMIT %s""",
        (user_id, vec, k),
    ).fetchall()
    return [r[0] for r in rows]

The <=> operator is pgvector's cosine distance. The IVFFlat index keeps queries under 10 ms even with millions of rows. Fold the recalled facts into the system prompt:

python
memories = recall(user_id="u_42", query=user_msg, conn=conn)

system = "You are a helpful assistant.\n\n"
if memories:
    system += "What you remember about this user:\n" + "\n".join(f"- {m}" for m in memories)

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=system,
    messages=trimmed_history,
)

Episodic Memory: Append-Only Event Log

Vector recall is great for "what do I know about this user" but useless for "what happened last Tuesday at 3pm". For that you need a flat, time-ordered log of every turn and tool call. This is also your debugging gold mine — when an agent does something weird in production, the episodic log is the first place you look.

sql
CREATE TABLE agent_episodes (
    id          BIGSERIAL PRIMARY KEY,
    session_id  TEXT NOT NULL,
    user_id     TEXT NOT NULL,
    role        TEXT NOT NULL,   -- user | assistant | tool
    content     JSONB NOT NULL,
    model       TEXT,
    tokens_in   INT,
    tokens_out  INT,
    latency_ms  INT,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON agent_episodes (session_id, created_at);
CREATE INDEX ON agent_episodes (user_id, created_at DESC);

Write to it on every turn. Never update, never delete — episodic memory is immutable by design. When you need to replay a session, SELECT * FROM agent_episodes WHERE session_id = ? ORDER BY created_at reconstructs the entire conversation including tool inputs and outputs. This pairs nicely with OpenTelemetry tracing for full observability.

Putting It Together

A turn in a memory-aware agent looks like this:

  1. Append user message to short-term history.
  2. Trim history with summarization if it's grown too long.
  3. Embed the user message and recall top-K long-term facts.
  4. Build the prompt: system + memories + trimmed history.
  5. Call the model.
  6. Write the user message and assistant reply to the episodic log.
  7. Optionally run a memory extractor on the turn and write new facts to long-term storage.

The whole pipeline is maybe 150 lines of glue code. The hard part is restraint: don't store everything in long-term memory, don't recall too aggressively, don't summarize too soon. Tune keep=N, top_k, and your extraction prompt against real conversations and watch the episodic log to see what the agent actually had in context.

What to Tune First

  • Memory extractor prompt — too eager and you save noise; too strict and you miss what matters. Start with "save names, preferences, decisions, and durable facts only."
  • Recall K — 3 to 7 is the sweet spot. More than that and the model starts treating recalls as more authoritative than the user's current message.
  • Summary cadence — every 10 turns is conservative. If turns are short, push it to 20.
  • Embedding modeltext-embedding-3-small is fast, cheap, and good enough. Only upgrade to large if recall quality is provably the bottleneck.

Memory architecture is one of those things that looks like over-engineering until you ship without it and your users start complaining that the bot has amnesia. Start with short-term + episodic on day one, add long-term recall the first time someone says "I told you this already."

Next Steps

  • Pair this with extended thinking for agents that reason about which memories matter.
  • Use prompt caching on the system + memory block to cut costs when memory churn is low.
  • See EzAI pricing for embedding and chat model rates.

Related Posts