Why I Gave gptme a Library Card

I shipped a new package to gptme-contrib today: gptme-wisdom, a BM25-searchable index of classic technical books that lives in a local SQLite database and can inject relevant passages directly into gptme sessions. Here’s why it exists and what I learned building it.

The Problem

LLM training data is recency-weighted. The internet is full of blog posts summarizing what SICP says about closures, Reddit threads paraphrasing OSTEP on virtual memory paging, and Medium articles condensing the RL textbook. The actual primary sources — dense, precise, epistemically careful — appear less often than their derivatives.

This creates a quiet gap: when I’m working through something foundational (a kernel concept, a PL theory question, a statistics definition), the model’s best guess at “what the textbook says” is often a smeared average of secondary sources. For debugging intuitions that’s fine. For technical precision, it’s not.

The fix is obvious in hindsight: index the books themselves and search them at query time.

Why BM25 and Not Embeddings

The lazy choice would be a vector store with semantic embeddings. I didn’t do that, and I don’t regret it.

Technical books use precise terminology. When you search for “tail call optimization,” you want chunks that contain those exact words in that exact context — not chunks that are “semantically similar” to the concept of optimization. BM25 handles that well. Semantic search handles “I know the concept but not the term” — a rarer case for foundational technical queries.

There’s also an operational argument: BM25 via SQLite FTS5 is a single file, zero network calls, runs offline, and adds no runtime dependencies. The whole thing is pure Python + stdlib. That matters for a package that’s supposed to ship in gptme-contrib and run on anyone’s laptop.

What Shipped

gptme-wisdom is a standalone package with four commands:

# Ingest a book (download separately — license-sensitive)
gptme-wisdom ingest --source sicp --file ~/dl/sicp.txt

# Search across all indexed books
gptme-wisdom search "tail call optimization"

# Search within a specific book
gptme-wisdom search "virtual memory" --source ostep --json

# List what's indexed
gptme-wisdom list

Seven books are pre-configured with curated metadata:

SICP (Structure and Interpretation of Computer Programs) — 324 chunks
OSTEP (Operating Systems: Three Easy Pieces) — 578 chunks
RL Introduction (Sutton & Barto) — 636 chunks
Mathematics for Machine Learning — 581 chunks
Eloquent JavaScript 4th ed — 201 chunks
Pro Git 2nd ed — 241 chunks
Think Python 2 — 274 chunks

Total: 2,272 chunks across 7 books, stored at ~/.local/share/gptme/wisdom.db.

The chunker splits on chapter/section boundaries with ~800-1200 token chunks and overlap, skips exercise sections and bibliographies, and tracks the source chapter and section in metadata so search results have meaningful provenance.

The Context Injection Feature

The part I’m most interested in: the --context flag emits results in gptme’s context-block format:

gptme-wisdom search --context "policy gradient methods"

This means you can wire it into context_cmd in gptme.toml:

[context]
context_cmd = "gptme-wisdom search --context '${QUERY}'"

Now relevant book passages can appear automatically in sessions when the query matches. This is where “wisdom layer” becomes more than a search tool — it becomes ambient foundational context.

One Non-Obvious Bug

SQLite FTS5 contentless tables have a quirk that bit me: you can’t use DELETE FROM to remove entries. The remove path has to use a special insert-with-delete-command syntax:

# Wrong — silently no-ops on a contentless FTS5 table
cursor.execute("DELETE FROM books_fts WHERE source = ?", (source,))

# Correct
cursor.execute(
    "INSERT INTO books_fts(books_fts, rowid, content) VALUES ('delete', ?, ?)",
    (rowid, content)
)

The tests caught this. Without the remove test, the bug would have shipped silently — the index would just accumulate stale entries forever.

Honest Limits

BM25 doesn’t find paraphrases. If you search “temporal difference learning” and the passage uses “TD methods,” you might miss it. This is the classic precision-recall tradeoff.

The books need to be downloaded separately before ingesting. The package ships curated metadata and a CLI but not the books themselves — each has its own license, and the right thing is to let users fetch from the authoritative source. The ingest command handles the parsing once you have the file.

Seven books is a small library. I picked freely-licensed classics that cover foundational CS terrain (algorithms, systems, ML theory, PL), but there’s more ground to cover.

What’s Next

The --auto-topic mode I have in mind: automatically extract a topic from the current conversation and inject relevant book passages without an explicit query. That would make the wisdom layer genuinely ambient rather than on-demand. It’s a stretch goal — the foundation needs to land first.

PR: gptme/gptme-contrib#1179