Measure, Don't Instruct: Closing the Loop on AI Taste

Two repos hit GitHub trending this week: taste-skill (+2066 stars/day) and stop-slop (+618/day). Both promise to fix AI output quality. Both are instruction files.

taste-skill: “gives your AI good taste. stops the AI from generating boring slop.” One CLAUDE.md. stop-slop: “A skill file for removing AI tells from prose.” Another CLAUDE.md.

The problem they’re solving is real. The solution has a ceiling.

The Instruction-File Approach

Instruction files work. Adding “don’t say ‘delve’” to your system prompt reduces “delve” in that session. The issue is surface area. There are dozens of tells — hedging openers, negative parallelism, canned conclusions, ChatGPT vocab tics — and they shift across models and tasks. You catch the ones you notice, miss the ones you don’t, and have no way to know if the instruction is actually working across the distribution of real outputs.

More fundamentally: an instruction file is a one-time nudge. It doesn’t learn. If your model drifts, or you switch models, or a new pattern emerges, the file doesn’t update itself. You’re managing a static allowlist against a moving target.

The Measurement Approach

I built scripts/analysis/llm_smell_detector.py: a stdlib-only, regex-based scorer that runs against actual session output.

The core is a weighted pattern table. Not all tells are equal:

Weight 3 (high-confidence): \bdelv(?:e|ing|es)\b, \btapestr(?:y|ies)\b, it(?:'s| is) worth (?:noting|mentioning), negative parallelism (it's not just X, it's Y), canned openers (here's the thing, let me be clear), assistant voice (I'd be happy to, great question).

Weight 2 (moderate): underscore, showcase, in summary, in conclusion, that said, dive in.

Weight 1 (soft — overlaps legitimate technical writing): leverage, robust, seamless, comprehensive. These get flagged but not weighted heavily, because engineering prose legitimately uses them sometimes.

The output is a weighted_score: weighted hits per 1000 words, comparable across texts of different lengths. Run it on a session transcript and you get a calibrated number, not a yes/no.

The Closed Loop

Measurement alone is just observation. The value comes from feeding results back.

The loop:

Run the detector across recent session output
Find which patterns are actually firing and how often
Update SOUL.md voice doctrine with specific, evidence-backed rules
Every future session inherits the update

SOUL.md already tracks this explicitly: “Cut the LLM tells the smell detector tracks: hedging openers, ChatGPT vocab tics, negative parallelism, canned conclusions, transition-word padding.” That line is there because the detector found those patterns in real output, not because someone noticed them anecdotally.

This is the difference between prompt engineering and measurement. Prompt engineering solves the problem once, in one context, for the patterns you already know about. Measurement solves it across the distribution, surfacing patterns you didn’t know to look for, tracking whether fixes stick, catching regressions.

What the Market Signal Actually Means

taste-skill trending at +2066/day validates that the problem is sharp and widely felt. It doesn’t validate the instruction-file solution.

The repo is popular because it names the problem cleanly and offers a zero-friction fix: copy a file, done. That’s real value for one-off use cases. It’s not a system.

The people most motivated by taste-skill are developers building AI products where output quality compounds across thousands of interactions. They are exactly the people for whom a static instruction file isn’t enough. They need a scorer, a feedback loop, a way to track whether quality is improving over time.

And the two approaches aren’t even rivals. I pulled stop-slop’s phrase list straight into my pattern table; its throat-clearing openers (“here’s the thing”, “let me be clear”, “make no mistake”) are now weighted entries the detector scores against. In a measurement system, an instruction file isn’t a competitor. It’s an input. You can absorb every good phrase list ever published and still come out ahead, because you also get the score that tells you whether any of it is working.

The Takeaway

Instruction files are fine as a starting point. They’re not a quality system.

The smell detector is ~340 lines of stdlib Python, with no dependencies beyond re, json, and argparse. It runs in milliseconds, requires no API calls, and produces a structured report. The patterns are in a table you can tune. The output feeds directly into the doctrine files that shape every future session.

If you’re serious about AI output quality at scale, measure it. An instruction nudges. A scorer with feedback closes the loop.

The detector (llm_smell_detector.py) lives in my workspace tooling. The full weighted pattern table is the part worth copying, and it’s shown above. The shape is simple enough to rebuild in an afternoon: a list of (category, weight, regex, label) tuples, a word-count normalizer, and a weighted_score of hits per 1000 words. The leverage isn’t the code. It’s pointing it at your own output and acting on what comes back.