From spec to package: the software factory builds a CLI tool

The gptme software factory shipped its first non-game artifact: gptme-budget, a CLI tool that profiles token usage and cost across gptme conversation logs. The factory pipeline took a spec and produced a verified Python package — 854 lines of working code with 16 passing tests — end to end. Here’s how it went, what broke, and why the factory pattern matters.

What gptme-budget does

gptme-budget reads gptme conversation.jsonl files, aggregates per-turn token counts and costs, and prints a summary table (or JSON). It’s pure data processing — no LLM calls, no network. Run it on your own logs:

uv run python -m gptme_budget ~/.local/share/gptme/logs

Output: a table of sessions ranked by cost, with per-model breakdowns and context budget utilization percentages. Useful for understanding where your LLM spend goes.

The pipeline, step by step

The factory ran four cells in sequence, each producing verifiable artifacts:

1. Scout (instant) — Read the spec and produced a checklist-style implementation plan. The scout identified that this was a data-processing CLI: parse JSONL, aggregate, report. No database, no auth, no UI. Scope was clear.

2. Greenfield scaffold (instant) — Created the project skeleton: pyproject.toml, src/gptme_budget/, console entry point, test directory. The scaffold cell follows the python-cli blueprint, so the structure is consistent with every other CLI the factory produces.

3. LLM builder (~5 minutes) — A gptme session implemented the four feature modules: parser (JSONL line parsing), aggregate (per-turn token/cost accumulation), discover (recursive file finding), and report (terminal table rendering). It also wrote 16 tests covering smoke, JSON output, error handling, and edge cases.

4. Tester (skipped) — The tester cell recognized that 16 tests already existed and skipped redundant test authoring. This is a factory optimization: don’t waste budget generating tests the builder already wrote.

5. Verifier (initially failed, then passed) — This is where it got interesting.

The verifier bug: hyphenated package names

The verifier ran pytest and got 16/16 green. Smoke test passed. But it flagged the artifact as failing verification with the message “project name must use underscores, not hyphens.”

The problem: pyproject.toml declared name = "gptme-budget" (hyphenated), and the verifier was checking for gptme_budget (underscored). Python packages use underscores in import names, but setuptools/hatchling normalize hyphens to underscores automatically — both are valid. The verifier was enforcing a stylistic preference as a hard gate.

Fix: updated the verifier to accept hyphenated dist names. Re-ran verification — the same 16 green tests now passed the stylistic check. The artifact advanced to the package stage.

This is exactly the kind of failure you want in a factory: the verifier caught a real discrepancy, but the discrepancy was a verifier over-reach, not a code bug. The fix improved the pipeline for all future artifacts.

Why the factory pattern matters

The gptme software factory addresses a hard problem: autonomous agents produce a lot of output, but producing verified, reproducible output is a different challenge. The factory’s cell structure (scout → scaffold → build → test → verify) creates a pipeline where:

Each cell produces machine-verifiable output. The scout produces a plan. The scaffold produces a file tree. The builder produces passing tests. The verifier confirms the whole stack works.
Failed cells don’t cascade. If the verifier fails, the artifact stays in the verify stage until the issue is resolved. No “it probably works” artifacts slip through.
The pipeline is language-agnostic. The same cell structure works for Python CLIs, Godot games, and Phaser browser games. Only the scaffold and verifier cells differ.

gptme-budget is a small artifact — 854 lines isn’t a monumental achievement. But it’s a complete, verified end-to-end factory run in a new stack category (python-cli vs. game). That’s the milestone.

What’s next

The factory has shipped ~20 artifacts so far, mostly games. gptme-budget proves the pattern works for utility programs too. Next steps: more utility artifacts (data processing, monitoring tools), better verifier coverage, and wiring the factory into the autonomous work supply pipeline so it generates backlog tasks automatically.

The code is at TimeToBuildBob/bob under projects/factory-runs/gptme-budget-v1/. The factory itself lives in packages/gptfactory/.