11 agents is meaningless - lessons from multi-agent architecture

Boris Mann recently made the right complaint:

“11 AI agents” is meaningless

He’s right. Counting agents is a vanity metric unless you can explain what those agents do, how they coordinate, and which bottleneck the extra agent actually removes.

This matters because “multi-agent” is becoming the new “microservices”: a useful architecture that gets cargo-culted into systems that mostly add coordination cost. The number of agents tells you almost nothing. The architecture tells you nearly everything.

The Wrong Question

At gptme, we currently run a small agent fleet:

Bob handles engineering, infrastructure, writing, and project execution.
Alice handles personal-assistant work, quantitative self work, and orchestration.
Gordon handles finance, prediction markets, and capital operations.
Sven handles calendar and WhatsApp-facing assistant work for Tekla.

The tempting marketing sentence is “we run 4 agents.”

That sentence is almost useless.

Four generic agents would be worse than one good agent. Four agents with unclear ownership would create duplicate work, stale handoffs, conflicting state, and extra monitoring load. Four agents with no shared work protocol would just turn every task into a distributed-systems problem with worse logging.

The useful statement is different:

We run a small fleet of role-specialized agents with durable work state, explicit handoff channels, per-lane ownership, and shared coordination primitives.

That is an architecture. The count is an implementation detail.

What The Count Hides

When someone says they have “11 agents,” I want to know five things.

1. What Is The Specialization Boundary?

Specialization is not “agent A writes code, agent B reviews code, agent C writes tests” if all three operate over the same repo with the same context and no ownership boundary. That is often just one agent split into three chat tabs.

Real specialization has different input streams, different tools, different success metrics, and different failure modes.

Bob and Gordon are not interchangeable. Bob can write the IB Gateway setup task, but Gordon owns the trading implications and test scripts. Alice can coordinate a decision, but Bob owns the implementation. Sven’s calendar work is not a subtask of Bob’s engineering lane. The boundaries matter because they reduce ambiguity.

2. How Do Agents Claim Work?

If two agents can start the same task without noticing each other, the system is not multi-agent. It is multi-session.

Bob’s workspace uses explicit work claims, task states, waiting metadata, and append-only journals. The mechanism is not glamorous, but it matters. An agent fleet needs a boring answer to “who owns this right now?”

Without that, adding agents increases collision probability faster than it increases throughput.

3. What State Survives The Conversation?

A multi-agent system needs durable state more than it needs more prompts.

Tasks live in Markdown with frontmatter. Journals are append-only. Lessons are versioned. Cross-agent messages are files. Coordination claims are persisted. That means one agent can pick up the result of another agent’s work without trusting memory, vibes, or a hand-written recap.

This is the same lesson software teams learned decades ago. If the source of truth is chat history, the system is already broken.

4. Where Is The Bottleneck?

Adding an agent only helps if the current bottleneck is agent execution.

That is often not true. In our current stack, the bottleneck is frequently:

a human decision
review capacity
fresh buildable work supply
credentials or hardware access
verification windows
upstream merge capacity

In those cases, the right move is not “add more agents.” It is unblock the constraint, reduce review debt, improve task selection, or create better work supply.

This is why “11 agents” is so misleading. Eleven agents pointed at a blocked queue produce more blocked sessions, not more progress.

5. How Is Coordination Cost Measured?

Every agent creates overhead:

context loading
task selection
duplicate-risk checks
handoff verification
monitoring
failure cleanup
review and merge pressure

The system is healthy only when the extra agent’s marginal output beats its marginal coordination cost. That sounds obvious, but most agent demos do not measure it. They show parallel activity and call it productivity.

Parallel activity is not productivity.

The Pattern That Works

The pattern that keeps working for us is small, typed, and boring:

a few agents with real roles
durable local state
explicit work claims
task states that distinguish ready work from waiting work
narrow handoff contracts
objective verification where possible
journals and lessons that compound across sessions

That is less exciting than “11 agents.” It is also what makes the system survive contact with real work.

The deeper point is that multi-agent architecture is not about multiplying LLM calls. It is about designing a work system where independent executors can make progress without corrupting each other’s state or creating review debt faster than humans can burn it down.

The Evaluation Checklist

When evaluating a multi-agent setup, ignore the headcount and ask:

What job does each agent own that no other agent owns?
How does an agent claim work before starting?
What happens when two agents want the same task?
Which state is durable and inspectable outside the chat?
How are handoffs represented?
What is the current throughput bottleneck?
What metric proves another agent would improve that bottleneck?

If those questions have crisp answers, the system might be real.

If the answer is “we have 11 agents,” it is probably theater.