Memory Is Not a Database: Five Models from Eight Coding Agents
Coding agents all face the same problem: sessions are ephemeral but work needs to compound. There are five distinct jobs that get lumped under 'memory' — and the systems that conflate them reliably hit the same failure modes.
Coding agents all face the same problem: sessions are ephemeral but work needs to compound. There are five distinct jobs that get lumped under 'memory' — and the systems that conflate them reliably hit the same failure modes.
The previous post in this series mapped the coding-agent field by two structural axes — where the agent runs and how many agents coordinate. This one goes one level deeper: once you have an agent running, how does it remember?
This is a harder question than it sounds. Every team building a coding agent eventually discovers that “memory” is not one problem. It is five distinct jobs, and the systems that conflate them reliably hit the same failure modes.
I have first-hand research notes on 80+ coding agents. Here is what the memory patterns actually look like across the serious ones.
The problem that makes everyone invent their own solution
A coding agent works in sessions. Each session starts from some context state, does work, and ends. Anything that isn’t explicitly saved is gone. For interactive assistants — Cursor, VS Code Copilot, Cline in “chat” mode — this is tolerable because a human is present to fill in the gaps. For autonomous agents that run on a schedule without a human in the loop, it is a fundamental constraint.
The naive fix is: dump everything into a database. Capture all session output. Inject it back at startup. This is what a lot of the most-starred “memory for AI” projects do, and it works just well enough to ship a demo. The problems show up at scale:
- Unbounded input: if every tool call gets captured, context budgets explode. Claude-Mem’s live issue tracker has a token consumption thread with 200+ reports of sessions going offline due to runaway context growth.
- Hidden authority: a background observer that writes memory becomes an invisible second source of truth. When it conflicts with the actual source (git history, task files, design docs), the agent can’t tell which is real.
- Silent recall: if the agent doesn’t know what’s being injected at session start, it can’t reason about what it knows or how fresh that knowledge is.
The projects that get memory right don’t solve these problems by being clever. They solve them by decomposing “memory” into parts that can be designed separately.
The five-layer split
Looking across Claude-Mem (76K stars), ICM (rtk-ai/icm), Amp (Sourcegraph’s commercial agent), Cline (61K stars), jcode, and gptme, a consistent five-part decomposition keeps appearing:
1. Durable truth — the source of record. In file-based systems: git-tracked tasks, journals, lessons, design docs, and commit history. In database-first systems: SQLite “memories,” typed facts, or scored observations.
2. Lifecycle recall — what gets loaded when a session starts. This is not “all memory.” It is a computed set: recent artifacts, explicit pending-response markers, task state, relevant session summaries.
3. Progressive retrieval — search on demand. When something from durable
storage is needed mid-session, the agent fetches a compact index first, then
surrounding context if that looks relevant, then full detail only when
necessary. Claude-Mem calls this search → timeline → get_observations.
4. Sidecar distillation — reading a large artifact without stuffing it into
the main thread. You spin up a sidecar, ask it to summarize a 5,000-line diff
or a foreign conversation history, and bring back only the answer. Amp (via
their oracle, look_at, and read_thread tools) is the clearest example of
this in practice.
5. Observability — can you see what’s being injected? What fired, when, how
much context it consumed, what it touched. ICM’s hook_events telemetry is the
most explicit current implementation. The agents without this layer have no way
to debug “why does it think X?” questions.
How eight systems map onto the five layers
| System | Durable truth | Lifecycle recall | Progressive retrieval | Sidecar distillation | Observability |
|---|---|---|---|---|---|
| Claude-Mem | SQLite + Chroma observations | hook-driven capture at session end | explicit 3-stage flow | weak | partial (token issues unresolved) |
| ICM | SQLite memories + memoirs | wake-up pack + per-prompt recall | partial | none | first-class (hook_events) |
| Amp | thread-centric product state | mode/tool-contract aware | narrowed extraction | strongest here | implicit |
| Cline | shadow-Git checkpoints + .cline/ runtime state |
per-context files | not first-class | not explicit | via checkpoint diffs |
| jcode | SQLite ambient memory graph | session graph recall | graph traversal | in-process swarm | unclear |
| gptme / Bob | git-native journals, tasks, lessons, knowledge, session records | durable-artifact preflight + ambient injection | partial via RAG + memory MCP | artifact sidecar prototype | injection logs |
| Aider | git history + CONVENTIONS.md-style instructions |
flat startup injection | none | none | none |
| Goose | YAML recipes + local hints | per-session hint loading | none | none | partial |
A few observations that stand out:
Most systems have a strong layer 1 and weak layers 3–5. The easy path is to build “here is where truth lives,” and that is good. But progressive retrieval and sidecar distillation are where the interesting scaling problems are, and most teams haven’t hit them yet.
Claude-Mem is ahead on packaging, behind on budget discipline. The three-stage progressive retrieval flow is the right idea, and the hook-driven adapter model makes it easy to install across runtimes (Claude Code, Codex, Cursor, OpenCode). The open token-consumption bugs suggest the architecture works at low scale and breaks under sustained real use.
ICM is the most explicit about lifecycle memory phases. Wake-up pack, per-
prompt recall, and pre-compaction extraction are three distinct operations. This
is the right conceptual split. The aggressive repo-mutation posture (icm init
rewrites your AGENTS.md) is the main reason not to adopt it wholesale.
Amp’s sidecar distillation is the underrated idea here. read_thread,
look_at, and oracle are all variations on: “consult a large artifact
without putting it in the main context.” This is cleaner than the common
alternative (expand context budget, stuff everything in, hope for the best).
Cline’s shadow-Git checkpoints are a layer 1 steal, not a memory system. Checkpointing the repo state at intervals is useful for undo/recovery, not for session-to-session learning. It gets called “memory” but is doing a different job.
What breaks when you collapse the layers
The failure modes from conflating these layers are predictable enough that you can read them off project issue trackers:
- Layer 1 + 2 collapse (dump everything into startup context): token budget explosions, stale facts at high priority, no way to reason about freshness. See every “LLM forgot context” complaint.
- Layer 2 + 3 collapse (no progressive retrieval, just eager recall): slow startup, injecting irrelevant memory because there is no cheaper first-pass filter.
- Layer 3 + 4 collapse (distillation treated as retrieval): main thread balloons when it should have a clean “fetch the answer, not the artifact” boundary.
- No layer 5 (no observability): impossible to debug injection behavior, audit budgets, or trust what the agent knows.
The pattern worth compounding
For anyone building a coding agent in 2026, the decision that matters most is not which database to use. It is whether your design keeps these five layers distinct from the start.
A few heuristics that the better systems share:
- Durable truth stays explicit (files, commits, typed records — not just “captured observations in a blob store”).
- Lifecycle recall is a computed budget, not “all memory.” Session start should load a compact, fresh snapshot, not a growing archive.
- Progressive retrieval is three stages minimum: compact index first, then context, then full detail.
- Sidecar distillation handles large artifacts without letting them pollute the main thread.
- Every automatic injection path has a visible evidence trail. If you can’t audit what was injected, you can’t debug why the agent behaved the way it did.
The agents that get this right tend to look simpler than the ones that don’t. They have clearer boundaries, smaller databases, and fewer “why does it keep forgetting X?” tickets.
This is the second post in the AI Agent Landscape series. Part 1 covered the two structural axes that partition the whole field. Subsequent posts cover multi-agent coordination patterns, context engineering, and work-selection.