Context, Memory, and the Illusion of Continuity

You told your AI assistant, last Tuesday, that you never want to hear the phrase "let's dive in" in a piece of marketing copy. On Thursday it produces a paragraph that opens with "Let's dive in." You sigh. You correct it. On Monday, same thing.

This is not a bug. It is the architecture.

The assistant you spoke to on Tuesday does not exist. The assistant you are speaking to on Monday is a different one, and will be different again tomorrow. The only thing that "continues" between those conversations is whatever you — or the tooling around the model — put into the context window at the start of each session. The feeling of continuity that an LLM chat interface produces is a fiction, held together by careful engineering of what gets loaded into the prompt every time the model is invoked.

This sounds bleak when you first internalise it. It is, in practice, liberating. Once you accept that the model has no native memory, you can stop trying to fake it and start engineering it. Memory is a design decision, not a feature, and if you are building with Claude Code for real work there are three distinct mechanisms that implement it — each with a different lifespan, a different cost, and a different correct use. Most of the frustration I hear from people doing multi-agent or long-running Claude work is the result of using the wrong one.

What continuity actually is

The model is stateless. Every turn of a conversation, it is re-run from scratch on the current context window, which the harness — Claude Code, the Anthropic API, whatever you are using — has loaded immediately before inference. There is no persistent "assistant" object sitting between turns. The "session" is a convention maintained by the harness: keep the context window around, hand it back in, and the illusion of continuity emerges.

This is different from how we talk about it. We say "Claude remembers what we discussed." Technically: the earlier discussion is still in the context window, and the model reads it from scratch every turn. A useful fiction, but a fiction.

Once you internalise this, a lot of otherwise-confusing behaviour becomes obvious. Why the assistant "forgets" things once the context is compacted. Why injecting a stale summary produces strange artefacts. Why the model sometimes confidently contradicts something it told you five messages ago — from its point of view, nothing ever told anyone anything; there is only the current stretch of text, and its job is to produce the next plausible token.

The three layers

Claude Code gives you three built-in mechanisms for getting state into the context window. They are not interchangeable.

1. The context window itself.

The text that is in the current session. Tokens. Finite — 200,000 or 1,000,000 depending on what you are running, but finite either way. Ephemeral: gone when the session ends, summarised or truncated as it fills.

Good for: work in progress. Intermediate calculations. The shape of the task you are currently doing. The back-and-forth of a conversation that you intend to wrap up today.

Bad for: anything you will want next week. Anything you will want in a different project. Anything that needs to survive the orchestrator deciding to compact.

2. CLAUDE.md — project instructions.

A markdown file at the project root, and optionally more at nested paths. Loaded into every session at start, without negotiation. This is where the agent's identity, scope, tooling, conventions, and hard rules live.

Good for: who the agent is. How this specific project works. Permanent rules — "never edit file X," "we use PHP 8.2 on this stack," "the live site lives at Y." Things that will still be true the next time anyone opens this project.

Bad for: anything that changes between sessions. Anything that needs to be updated as work progresses. Transient state that will be outdated in a fortnight.

3. The auto-memory system — a directory of small markdown files.

Each file is a single discrete memory: a user preference, the current status of a project, a piece of feedback received, a reference to an external system. A MEMORY.md index lists each one with a short description. On session start, the index is loaded; individual memory files are pulled into context as and when they become relevant.

Good for: things the agent has learned over time and will need again. User preferences that have already been stated once ("don't mock the database in tests — we got burned by a mocked test passing before a failing prod migration"). Project state that outlives a session but changes ("the contact form is live as of April 6th; reCAPTCHA site key is…"). Feedback on approach that emerged from a correction.

Bad for: things so obviously true they belong in CLAUDE.md. Things so narrow they are only useful within the current conversation. Large reference material — that is what your filesystem is for.

(And one more, for honesty.) Beyond the three Claude Code mechanisms, there is the filesystem, the database, git history, and any external systems the agent can query on demand. I do not think of these as "memory" in the same sense — they are query-on-demand data stores, and their lifespan is exactly as persistent as whatever is storing them. But a good agent knows which ones to consult and when, and a good CLAUDE.md points at them explicitly.

Which belongs where

Two rules of thumb get you most of the way.

Rule one. If the same piece of information would need to appear in the system prompt every single time the project is opened — the tech stack, the agent's identity, the non-negotiable rules — it belongs in CLAUDE.md. It is not memory; it is configuration. Treat it as such.

Rule two. If the information needs to survive the end of a session, but might be different in three months' time, it belongs in a memory file. Memory is for facts that are currently true. Configuration is for facts that are permanently true. Knowing the difference saves you a lot of grief.

A third, more subtle rule. If the information is big, structured, or queryable — a dataset, a codebase, an issue tracker, a comms log — do not stuff it into memory at all. Leave it in its natural home and point the agent at it. A memory file that says "bugs are tracked in the INGEST project on Linear" is worth a thousand memory files that attempt to summarise the current state of Linear.

Why not one big vector database?

The question always comes up. Why not take all your project state, chunk it, embed it, and let the model retrieve against cosine similarity?

I have tried this. I have reviewed client implementations that did it. The honest summary is that for the vast majority of Claude Code use cases, vector retrieval is wildly more machinery than the problem requires, and it performs worse than the unglamorous file-based alternative.

The issues are consistent:

Retrieval is coarse. Cosine similarity over chunks returns passages that are topically adjacent to the query but often miss the specific point. You get the neighbourhood of the answer instead of the answer.
The unit of retrieval is not the unit of use. Memory, in practice, is used as whole documents — a preference, a rule, a status note. Vector DBs hand you fragments sized for embedding, which then have to be reassembled into something coherent for the prompt.
It degrades silently. More data means lower precision. The system gets subtly worse as the corpus grows, and there is no obvious alarm.
You cannot read it. The state is opaque. You cannot grep it, edit a single memory without re-indexing, or inspect what the agent actually sees at a given moment.

Flat markdown files, indexed by a MEMORY.md pointer file, have: perfect retrieval (you either wrote the memory or you did not), human-readable and editable state, and composability with every tool you already own. I can grep my agents' memories. I can edit them with a text editor. I can diff them in git. The performance ceiling is, in my experience, materially higher than the vector-DB systems I have been asked to evaluate in production.

There are genuine cases for vector retrieval — unstructured corpora too large to enumerate, unknown query spaces, latency constraints that preclude filesystem reads. They exist. They are rarer than the architecture diagrams suggest. For the common case of "an agent that has learned things about the user and the project over time," a directory of markdown files and an index is the right answer, and it is not close.

Engineering the illusion

The framing I have ended up with is mechanical, and I find it clarifying. The context window is RAM. The filesystem and database are disk. CLAUDE.md is the kernel config that loads on every boot. The memory files are the rc scripts that run at session start. Each layer has a correct use, and the failure modes I see most often all come from confusing them.

Continuity is not something the model provides. It is a composition of cheap, well-understood, file-based primitives that the harness arranges around the model. Once you stop trying to make the model remember, and start deliberately designing the context that gets handed to it at each turn, the whole system becomes much easier to reason about — and, not incidentally, much more debuggable. When an agent "forgets" something, I can tell you exactly which file was or wasn't loaded, and either fix the index or update the memory.

None of this is exotic infrastructure. My entire memory system, across twenty-five agents, is a directory of markdown files per project and a small hook that surfaces the right index on session start. That is the whole machine. It has been running for months without surprise.

The model does not remember you. Your files do. The only question is whether you have arranged them properly.