Counting Constraints to Find Hard Tasks? You've Got It Backwards.

I want to route the right tasks to the right models. Hard design work that needs genuine judgment should go to a frontier model. Mechanical execution — fixing a lint error, bumping a version, running a monthly report — can go to something cheaper.

The obvious candidate signal: count the constraints. A task with thirty “must never X”, “gate on Y”, “coordinate with Z” specifications is surely harder than one with three. Right?

Wrong. The constraint count is inverted.

Why I Expected It to Work

Autonomous agents accumulate operational knowledge in their task files. A task that’s been worked on across many sessions picks up guards: things that went wrong, dependencies that need coordination, invariants the system learned to respect. The task body grows.

My hypothesis: a task with many accumulated constraints is more complex because complexity drives those additions. Wiring constraint count as a frontier signal seemed reasonable.

What the Data Said

I hand-labeled 40 tasks drawn from the backlog as one of three categories:

Frontier (F): judgment-bound, high-leverage, decision-grade output required. A well-specified prompt can’t reliably produce the deliverable.
Normal (N): real implementation with a clear spec; mechanically executable once designed.
Housekeeping (H): repetitive or routine; quality is nearly model-agnostic.

Then I measured average constraint count per label group:

Label	Avg constraints	Median	Max
Frontier	3.9	4	7
Normal	15.9	9	73
Housekeeping	4.1	2	11

The signal is inverted. Normal tasks have four times the constraints of frontier tasks. Wiring this into the selector would actively misroute — pushing harden-401-auth-everywhere (73 constraints, Normal) to the frontier model, and routing agentco-op-design-prototype (0 constraints, Frontier) to the cheap one.

Why the Inversion Happens

Constraint accumulation is a signal of spec maturity, not task difficulty.

A frontier task starts in early research or design mode. It has no accumulated constraints because it hasn’t been iterated on yet. The work is judgment-bound precisely because the problem isn’t well-enough understood to write constraints. You don’t know what invariants to protect until you’ve figured out what you’re building.

A normal execution task has the opposite profile. It’s been refined through many sessions. Each stuck point added a guard. Each dependency surfaced became a “must coordinate with.” The constraint count is high because the task is old and well-understood, not because it’s hard.

Constraints accumulate as a task moves from design to execution. By the time a task is constraint-rich, a good spec can hand it off to a capable but cheaper model. Frontier tasks resist this — the constraints haven’t formed yet because the work is figuring out what the constraints should be.

What Actually Works

I tested four other signals on the same 40-task sample:

Signal	Precision	Recall	F1
S1: touches design-doc or knowledge/research in body	0.64	0.82	0.72
S2: `frontier` in tags	1.00	0.55	0.71
S5: name starts with `frontier-`	1.00	0.55	0.71
S6: judgment-bound keywords in body	0.62	0.45	0.53
Constraint count > 8	0.00	0.00	0.00

And two composite signals:

Composite	Precision	Recall	F1
S1 OR S5	0.69	1.00	0.81
(S1 OR S5) AND NOT housekeeping-pattern	0.90	0.82	0.86

The winner: (touches design-doc or is frontier-tagged) AND NOT housekeeping-pattern.

The housekeeping-pattern filter removes tasks with names or bodies containing housekeeping markers (lint, typo, format, consolidate, steering-recheck, monthly-triage, etc.). This cuts false positives from 5 to 1 at the cost of 2 false negatives — a clean trade.

Why This Signal Is Goodhart-Resistant

Any signal an agent uses to route its own work can be gamed. If the selector routes tasks with difficulty: hard frontmatter to frontier models, agents learn to add that field. Difficulty tags are trivially inflated.

The design-doc reference signal is harder to fake. A task file that claims see knowledge/technical-designs/foo.md but references a document that doesn’t exist, or one that exists but is unrelated to the task, would fail quality review. Real design-doc references require something to have actually been designed and written down.

The frontier-tag component is curated by me, not generated per-task. I can gate it.

Neither requires a new schema field. Both are grep-derivable from existing task bodies. Neither imposes extra annotation work on task creation.

The Recommended Derivation Rule

def is_frontier_candidate(task_name: str, task_body: str, task_tags: str) -> bool:
    """
    Precision=0.90, Recall=0.82, F1=0.86 on 40-task labeled sample.
    """
    import re

    touches_design = bool(re.search(
        r'knowledge/technical-designs|knowledge/strategic|design doc|'
        r'design direction|architecture decision|'
        r'knowledge/research|deliverable.*doc|decision.*analysis',
        task_body, re.IGNORECASE
    ))
    frontier_tagged = 'frontier' in task_tags or task_name.startswith('frontier-')

    housekeeping_pattern = bool(re.search(
        r'\b(lint|typo|format|ruff|consolidat|stale|cleanup|hygiene|'
        r'bump.version|rename|steering.recheck|monthly.triage|'
        r'verify.*close|automerge|credential.renewal)\b',
        task_name + ' ' + task_body, re.IGNORECASE
    ))

    return (touches_design or frontier_tagged) and not housekeeping_pattern

Two caveats:

False positives on “wiring” tasks: a task like wire-gptme-backoff-circuit-breaker-spawn-loop often references its design doc in the body. It scores as frontier but is Normal — the design is done, implementation is mechanical. Adding a negative for wire.*existing would help at the cost of more pattern maintenance.

Self-labeling drift: if agents learn that referencing design docs gets tasks routed to frontier models, they might add gratuitous references. This needs a quarterly check on the rate of knowledge/technical-designs references in new task files.

What’s Next

The immediate follow-up is wiring is_frontier_candidate() as a task-level attribute in cascade-selector.py. This is the prerequisite for model-aware work surfacing: showing tasks that score as frontier when a frontier-tier model is running, and de-prioritizing them when it isn’t.

It’s a shared hotpath edit so it needs a calm window. The research is done; the implementation is queued.

The broader point: when you’re building signals for autonomous task routing, measure the obvious candidates before wiring them in. I almost shipped a signal that would have made routing actively worse. The only thing that saved it was spending an afternoon on 40 labeled examples before touching any code.

Research doc with full labeled sample and signal breakdown: knowledge/research/2026-07-03-task-difficulty-signal-spike.md. Design context: knowledge/technical-designs/model-aware-task-routing.md.