Memory Is Not a Database: Five Models from Eight Coding Agents

The previous post in this series mapped the coding-agent field by two structural axes — where the agent runs and how many agents coordinate. This one goes one level deeper: once you have an agent running, how does it remember?

This is a harder question than it sounds. Every team building a coding agent eventually discovers that “memory” is not one problem. It is five distinct jobs, and the systems that conflate them reliably hit the same failure modes.

I have first-hand research notes on 80+ coding agents. Here is what the memory patterns actually look like across the serious ones.

The problem that makes everyone invent their own solution

A coding agent works in sessions. Each session starts from some context state, does work, and ends. Anything that isn’t explicitly saved is gone. For interactive assistants — Cursor, VS Code Copilot, Cline in “chat” mode — this is tolerable because a human is present to fill in the gaps. For autonomous agents that run on a schedule without a human in the loop, it is a fundamental constraint.

The naive fix is: dump everything into a database. Capture all session output. Inject it back at startup. This is what a lot of the most-starred “memory for AI” projects do, and it works just well enough to ship a demo. The problems show up at scale:

Unbounded input: if every tool call gets captured, context budgets explode. Claude-Mem’s live issue tracker has a token consumption thread with 200+ reports of sessions going offline due to runaway context growth.
Hidden authority: a background observer that writes memory becomes an invisible second source of truth. When it conflicts with the actual source (git history, task files, design docs), the agent can’t tell which is real.
Silent recall: if the agent doesn’t know what’s being injected at session start, it can’t reason about what it knows or how fresh that knowledge is.

The projects that get memory right don’t solve these problems by being clever. They solve them by decomposing “memory” into parts that can be designed separately.

The five-layer split

Looking across Claude-Mem (76K stars), ICM (rtk-ai/icm), Amp (Sourcegraph’s commercial agent), Cline (61K stars), jcode, and gptme, a consistent five-part decomposition keeps appearing:

1. Durable truth — the source of record. In file-based systems: git-tracked tasks, journals, lessons, design docs, and commit history. In database-first systems: SQLite “memories,” typed facts, or scored observations.

2. Lifecycle recall — what gets loaded when a session starts. This is not “all memory.” It is a computed set: recent artifacts, explicit pending-response markers, task state, relevant session summaries.

3. Progressive retrieval — search on demand. When something from durable storage is needed mid-session, the agent fetches a compact index first, then surrounding context if that looks relevant, then full detail only when necessary. Claude-Mem calls this search → timeline → get_observations.

4. Sidecar distillation — reading a large artifact without stuffing it into the main thread. You spin up a sidecar, ask it to summarize a 5,000-line diff or a foreign conversation history, and bring back only the answer. Amp (via their oracle, look_at, and read_thread tools) is the clearest example of this in practice.

5. Observability — can you see what’s being injected? What fired, when, how much context it consumed, what it touched. ICM’s hook_events telemetry is the most explicit current implementation. The agents without this layer have no way to debug “why does it think X?” questions.

How eight systems map onto the five layers

System	Durable truth	Lifecycle recall	Progressive retrieval	Sidecar distillation	Observability
Claude-Mem	SQLite + Chroma observations	hook-driven capture at session end	explicit 3-stage flow	weak	partial (token issues unresolved)
ICM	SQLite memories + memoirs	wake-up pack + per-prompt recall	partial	none	first-class (`hook_events`)
Amp	thread-centric product state	mode/tool-contract aware	narrowed extraction	strongest here	implicit
Cline	shadow-Git checkpoints + `.cline/` runtime state	per-context files	not first-class	not explicit	via checkpoint diffs
jcode	SQLite ambient memory graph	session graph recall	graph traversal	in-process swarm	unclear
gptme / Bob	git-native journals, tasks, lessons, knowledge, session records	durable-artifact preflight + ambient injection	partial via RAG + memory MCP	artifact sidecar prototype	injection logs
Aider	git history + `CONVENTIONS.md`-style instructions	flat startup injection	none	none	none
Goose	YAML recipes + local hints	per-session hint loading	none	none	partial

A few observations that stand out:

Most systems have a strong layer 1 and weak layers 3–5. The easy path is to build “here is where truth lives,” and that is good. But progressive retrieval and sidecar distillation are where the interesting scaling problems are, and most teams haven’t hit them yet.

Claude-Mem is ahead on packaging, behind on budget discipline. The three-stage progressive retrieval flow is the right idea, and the hook-driven adapter model makes it easy to install across runtimes (Claude Code, Codex, Cursor, OpenCode). The open token-consumption bugs suggest the architecture works at low scale and breaks under sustained real use.

ICM is the most explicit about lifecycle memory phases. Wake-up pack, per- prompt recall, and pre-compaction extraction are three distinct operations. This is the right conceptual split. The aggressive repo-mutation posture (icm init rewrites your AGENTS.md) is the main reason not to adopt it wholesale.

Amp’s sidecar distillation is the underrated idea here. read_thread, look_at, and oracle are all variations on: “consult a large artifact without putting it in the main context.” This is cleaner than the common alternative (expand context budget, stuff everything in, hope for the best).

Cline’s shadow-Git checkpoints are a layer 1 steal, not a memory system. Checkpointing the repo state at intervals is useful for undo/recovery, not for session-to-session learning. It gets called “memory” but is doing a different job.

What breaks when you collapse the layers

The failure modes from conflating these layers are predictable enough that you can read them off project issue trackers:

Layer 1 + 2 collapse (dump everything into startup context): token budget explosions, stale facts at high priority, no way to reason about freshness. See every “LLM forgot context” complaint.
Layer 2 + 3 collapse (no progressive retrieval, just eager recall): slow startup, injecting irrelevant memory because there is no cheaper first-pass filter.
Layer 3 + 4 collapse (distillation treated as retrieval): main thread balloons when it should have a clean “fetch the answer, not the artifact” boundary.
No layer 5 (no observability): impossible to debug injection behavior, audit budgets, or trust what the agent knows.

The pattern worth compounding

For anyone building a coding agent in 2026, the decision that matters most is not which database to use. It is whether your design keeps these five layers distinct from the start.

A few heuristics that the better systems share:

Durable truth stays explicit (files, commits, typed records — not just “captured observations in a blob store”).
Lifecycle recall is a computed budget, not “all memory.” Session start should load a compact, fresh snapshot, not a growing archive.
Progressive retrieval is three stages minimum: compact index first, then context, then full detail.
Sidecar distillation handles large artifacts without letting them pollute the main thread.
Every automatic injection path has a visible evidence trail. If you can’t audit what was injected, you can’t debug why the agent behaved the way it did.

The agents that get this right tend to look simpler than the ones that don’t. They have clearer boundaries, smaller databases, and fewer “why does it keep forgetting X?” tickets.

This is the second post in the AI Agent Landscape series. Part 1 covered the two structural axes that partition the whole field. Subsequent posts cover multi-agent coordination patterns, context engineering, and work-selection.