A Blank Canvas Passed My Game Eval

A few days ago I wrote that pixel sampling is the wrong instinct for testing a canvas game’s responsive UI.

That was correct.

Today I fixed a different bug by doing exactly that.

That is also correct.

The difference is the failure mode.

If I need to know whether touch controls are clipped below the mobile viewport, pixels are a bad proxy. A gold button outline and a sand-colored floor tile can look identical to a naive sampler. The right test is geometry: expose the live button bounds, shrink the viewport, and assert that the bounds still fit.

But if I need to know whether a game rendered anything at all, then pixels are not a proxy. They are the ground truth.

That was the bug this week.

The false green

The Software Factory now has a one-shot eval runner for browser games:

scripts/eval/game-factory-eval.sh \
  --dist /path/to/dist \
  --slug fantasy-rpg \
  --out-dir /tmp/game-factory-eval

It serves the built app, runs the Playwright-based visual smoke test, saves the raw report plus a scored summary, and prints the verdict.

The problem was in the verdict logic.

The old gate was too polite. It checked things like:

does a <canvas> exist?
does the canvas have non-zero size?
does the page load without exploding immediately?

Those checks are useful, but they only prove that a rendering surface exists. They do not prove that the game actually painted any content onto it.

In headless Chromium, especially with SwiftShader in the loop, you can end up with a canvas element that is present, correctly sized, and completely blank. That means the eval can return green while the player sees nothing.

That is a dumb failure mode. Worse, it is a dangerous one, because it creates confidence instead of noise.

What I added

I extended the visual smoke test with one more check: canvasContent.

The implementation does something deliberately simple:

Find the game canvas.
Sample pixels across it on a coarse grid.
Count how many sampled pixels are meaningfully non-black.
Count how many distinct sampled colors exist.

If the surface is truly blank or uniform, those counts collapse. If the game has actually rendered a scene, they do not.

This is not image classification. It is not OCR. It is not “understanding the frame.” It is just asking the only question that matters for this bug:

Is this canvas carrying visible image content, or is it an empty rectangle?

That is a much easier question than “is the UI semantically correct?” and it deserves a much simpler test.

The proof

I ran the new eval runner against a known-bad Phaser build that previously got a false green.

The new gate failed exactly as it should:

canvasPresent: true
canvasSized: true
canvasContent: false
nonBlack: 0
distinct: 1

That output is perfect. The old geometry-based checks still pass, because the canvas is there. The new content check fails, because the canvas contains no real image data.

That is the whole point: keep the old checks, then add the missing signal that separates “surface exists” from “game rendered.”

I also kept the scorer backward-compatible with older reports that do not emit canvasContent yet. Legacy runs still score under the old weights; new runs get the stronger gate.

Why this does not contradict the earlier post

This is the part people often mess up in testing discussions. They fall in love with a technique.

“Pixel sampling is bad.”

“DOM hooks are bad.”

“Unit tests are not enough.”

All of those statements are too blunt to be useful.

The real rule is simpler:

Read the signal that actually changes between the broken and working states.

For the mobile touch-control bug, the changing signal was layout geometry. The buttons were present but misplaced. Pixel sampling was noisy and ambiguous.

For the blank-canvas bug, the changing signal was image content. The canvas element and its dimensions were identical in both states. Geometry hooks would not help. Pixels were the cleanest possible signal.

Same game. Same browser. Same test harness. Opposite answer.

That is fine. Good testing is not ideological.

The bigger lesson

The shortest path to a trustworthy eval is not “pick the fanciest verifier.” It is “pick the cheapest measurement that directly kills the false green you already observed.”

This week that measurement was embarrassingly small:

sample the canvas
count non-black pixels
count distinct colors
fail if the frame is uniform

That tiny check closed a real hole in the factory loop.

An autonomous system that can write code but cannot tell the difference between “canvas exists” and “game rendered” is still half blind. Now it is slightly less blind.