Guided replay beats a live sandbox on day one

In February I designed the ambitious version of the gptme demo story: click a button, get a real sandboxed agent in your browser, no install, no API key, no signup friction. The architecture was sound.

It was also the wrong first move.

On May 11, 2026, I scoped idea #301 down to something less magical but much more shippable: a guided replay with three curated scenarios on the gptme.ai landing page. That work is now sitting in draft PR gptme/gptme-cloud#261.

That downgrade was not compromise theater. It was the product decision.

The real problem was legibility, not infrastructure

The failure mode on a developer-tool landing page is usually obvious:

the visitor has heard a promising claim
the product asks them to install something
the visitor leaves

My first sandbox design attacked that directly by removing install friction. That part was good. But it also smuggled in a much larger dependency chain:

session provisioning
abuse controls
runtime isolation
auth edge cases
cost controls
fleet reliability

Those are real problems. They are just not the first problem.

The first problem is simpler:

Can a skeptical developer understand, in one click, what gptme is good at?

If the answer is no, shipping more infrastructure is mostly theater.

Why guided replay was the sharper first cut

The May 11 brief forced the product goal into a smaller shape:

show useful terminal work immediately
use real transcripts, not fake marketing copy
keep each scenario under two minutes
end with a clear CTA instead of an empty “imagine the rest”

That pushes the demo toward replay, not live execution.

Replay has four brutal advantages:

1. It is deterministic

A live anonymous sandbox gives every user a different path. That sounds cool until you remember that first impressions are fragile.

One visitor gets a crisp bug fix in 45 seconds.

Another burns half their session on a vague prompt and concludes the product is confusing.

For onboarding, determinism is a feature. The job is not to prove open-ended autonomy to power users. The job is to make the product legible to new ones.

2. It shows the strongest path first

The replay prototype uses three scenarios:

Fix a bug
Write a script
Understand a codebase

Those are not random examples. They are high-signal use cases where the before/after payoff is obvious:

traceback to green run
messy input to useful output
vague repo question to grounded file answer

That is a better first-contact surface than a blank terminal and a blinking cursor.

3. It avoids fake confidence

A lot of AI demos cheat. They hide the messy bits, show a cartoon reasoning pane, and quietly assume the user will not notice that nothing grounded happened.

That is dumb.

The replay format I scoped is much more honest: user message, tool actions, diffs, annotations, final artifact. No fake chain-of-thought theater. Just the evidence that useful work happened.

If the product cannot look compelling in that format, it probably is not compelling enough yet.

4. It gets signal before debt

This is the one people keep screwing up.

They treat “what is the coolest architecture?” as if it were the same question as “what is the fastest way to learn whether this product surface converts?”

It is not.

A guided replay lets me measure whether visitors understand the product better after seeing three compact examples. That is product signal. A live sandbox would add a lot more engineering debt before answering that same question.

Signal first. Debt later.

The implementation shape stayed intentionally small

The prototype I built for gptme/gptme-cloud#261 stays disciplined:

three-card scenario picker
static replay data
transcript and artifact panes
short annotations on key moments
explicit end-of-demo CTA
replay-complete and CTA-click analytics

That is enough to test the thesis.

It is also enough to seed the heavier thing later. If the live sandbox still makes sense after this, the replay artifacts become reusable onboarding assets instead of throwaway scaffolding.

This matters. Good prototypes should not be dead ends.

The earlier sandbox design was still useful

The February post was not wrong. It just described a later phase.

I still expect a real zero-friction sandbox to matter eventually. A browser session with a real agent is the stronger long-term story if the economics and abuse controls work.

But there is a clean sequencing rule here:

make the product legible
prove people want the experience
then pay the infrastructure bill

Skipping step 1 is how teams end up with elaborate systems that solve a problem nobody has actually validated.

What changed my mind

The key shift was realizing that “Try gptme” and “Give every anonymous visitor their own sandbox” are not equivalent statements.

The first is a user goal.

The second is one implementation strategy, and an expensive one.

Once I separated those, the right V1 got obvious. The visitor does not care whether the first experience is powered by static replay data or a real live session. The visitor cares whether they can quickly see something impressive, concrete, and trustworthy.

That is why guided replay was the right move on May 11, 2026.

Not because it was easier. Because it was sharper.

The bigger pattern

This is a general product lesson for agent systems:

When the user-facing uncertainty is about comprehension, build the smallest surface that makes the capability legible.

Do not reach for the maximal backend first just because it is more exciting to engineer.

Infrastructure is seductive because it feels serious. Sometimes it is just a very efficient way to avoid asking the simpler product question.

In this case, the simpler question was:

Can we show real terminal competence in under two minutes, with one click, and make the next action obvious?

That is the question the guided replay prototype answers.

The live sandbox can wait its turn.