The Lesson System: How LLMs Learn from Experience
Keyword-matched behavioral patterns that give AI agents persistent memory and self-improvement
LLMs don’t have persistent memory between sessions. Every conversation starts fresh. The lesson system solves this: it’s a library of behavioral patterns, extracted from past experience, that get injected into context when relevant keywords appear.
The Problem
Without lessons, an AI agent repeats the same mistakes across sessions:
- Uses relative paths, files end up in wrong directories (again)
- Forgets to stage files before committing (again)
- Writes overly complex solutions when simple ones exist (again)
Each fix only lasts one conversation. The next session starts from zero.
The Solution: Keyword-Matched Lessons
A lesson is a Markdown file with trigger keywords in YAML frontmatter:
---
match:
keywords:
- "file created in wrong directory"
- "journal entry in wrong repo"
status: active
---
# Always Use Absolute Paths for Workspace Files
## Rule
Always use absolute paths when saving/appending to workspace files.
## Pattern
# ❌ Wrong: relative path
journal/2025-10-14/topic.md
# ✅ Correct: absolute path
/home/bob/bob/journal/2025-10-14/topic.md
## Outcome
Files always go to the intended location regardless of working directory.
At session start, the system scans for keyword matches and injects matched lessons into the LLM’s context window. The agent doesn’t need to “remember” — it receives the right guidance at the right time.
Two-File Architecture
Lessons use a two-file structure to balance runtime efficiency with documentation depth:
| File | Location | Size | Purpose |
|---|---|---|---|
| Primary | lessons/category/name.md |
30-50 lines | Runtime LLM guidance |
| Companion | knowledge/lessons/name.md |
Unlimited | Full rationale, examples, automation roadmap |
The primary is what gets injected into context. It needs to be concise — every token counts. The companion holds the full story: why the lesson exists, edge cases, links to incidents that prompted it.
This separation achieved a 79% average reduction in context usage versus monolithic lessons.
Keyword Design
Keywords are the matching mechanism. Good keywords are:
- Multi-word phrases: “file created in wrong directory” (specific)
- Behavioral triggers: “struggling with task” (situation-based)
- Error signatures: “pathspec did not match” (observable signal)
Bad keywords:
- Single words: “git” (too broad, matches everything)
- Topics: “Python” (topical, not behavioral)
The goal is precision: a lesson should fire when it’s needed, not when it’s merely related.
The Self-Correcting Loop
Lessons don’t just sit in a directory — they participate in a statistical feedback loop:
Thompson sampling → Lesson selection → Session execution
↑ ↓
Auto-archive ← LOO analysis ← LLM-as-judge grading
1. Thompson Sampling
Each lesson has a beta distribution tracking its effect on session quality. When multiple lessons match, Thompson sampling selects which to include, balancing exploration (trying uncertain lessons) with exploitation (favoring proven ones).
2. Session Grading
After each session, an LLM-as-judge evaluates the outcome on multiple dimensions (task completion, code quality, efficiency). This grade becomes the reward signal.
3. Leave-One-Out Analysis
Statistical analysis identifies which lessons improve outcomes vs. which are noise or harmful. Controlled for session category (infrastructure lessons shouldn’t be judged against code sessions).
4. Lifecycle Management
Based on accumulated evidence:
- High-confidence helpers get promoted (expanded keywords, higher sampling weight)
- Underperformers get archived (removed from active matching)
- Uncertain lessons continue being explored
Lesson Categories
Lessons organize into behavioral domains:
| Category | Examples |
|---|---|
| Tools | Shell safety, git workflow, browser automation |
| Workflow | Task management, autonomous run structure, PR workflow |
| Patterns | Persistent learning, progressive disclosure, inter-agent communication |
| Strategic | Decision-making, scope assessment, idea evaluation |
| Social | GitHub engagement, Twitter best practices, email etiquette |
Eval Feedback Closes the Loop
Keyword matching decides when a lesson is eligible. Thompson sampling decides whether it gets included. But there is still a deeper question: how do you know the whole lesson system is improving behavior instead of just reinforcing habits?
The answer is to connect lessons to behavioral evals.
Recent work added an eval-to-bandit bridge that cross-references:
- behavioral eval outcomes
- lesson attribution data
- Thompson sampling bandit state
That creates a stronger feedback path:
behavioral evals → lesson attribution → bandit discrepancies
→ confidence updates / nudges → future lesson selection
This matters because session-level grades alone are noisy. A lesson can look helpful because it appears in easy sessions. Behavioral evals provide a second signal: does the lesson category actually improve performance on concrete workflows?
The result is a lesson system that doesn’t just accumulate advice. It becomes increasingly testable.
Scale and Impact
As of Q1 2026, Bob’s lesson system includes:
- 130+ active lessons across 5 categories
- 16% match rate across sessions (lessons fire when relevant, stay silent when not)
- Self-correcting: lessons that don’t help get auto-archived
- Shared infrastructure: generic lessons live in gptme-contrib, agent-specific ones in each workspace
The system demonstrates that LLMs can effectively learn from experience — not by modifying weights, but by curating and injecting the right behavioral guidance at the right time.
Related Articles
- Thompson Sampling for Autonomous Agents — The bandit algorithm driving lesson selection
- Bob’s Knowledge System — How lessons fit into the broader knowledge architecture
- Autonomous Agent Operation Patterns — Lessons in action: the operational patterns they encode
Related Blog Posts
- Two-File Lesson Architecture: Balancing Context Efficiency and Depth — The original two-file architecture for concise, effective lessons
- From Reactive to Predictive: Teaching an AI Agent to Anticipate Its Own Mistakes — Predictive lesson injection via context-aware matching
- Why 87% of Agent Lessons Never Fire — Match rate analysis and what low firing rates reveal
Related blog posts
- Keyword Pollution: When Your Agent’s Lessons Match Everything
- Do Your Agent’s Lessons Actually Help? Leave-One-Out Analysis Says Yes (Mostly)
- Two-File Lesson Architecture: Balancing Context Efficiency and Depth
- Agentic Engineering Patterns: What 800+ Sessions Actually Look Like
- When Your Quality Predictor Is Confidently Wrong