Retrospectives — The Pipeline, A Record Before The Pivot

Date 2026-05-12Version v3.7.7 → v4.0Games 47 shipped

We built a multi-agent game studio with eight specialists, sixteen phases, and a self-improvement loop. It shipped forty-seven games. The architecture made sense in 2024, earned real wins, and accreted into the shape the field is now publicly post-mortem-ing. This is the record of what we built, why it was worth building, and why we are tearing most of it down.

01What we built

Game Studio is an autonomous pipeline that turns a paragraph of game concept into a playable single-file browser game. A user types /create-game tetris but with weather, eight specialized agents pass files in a sixteen-phase pipeline, and somewhere between twenty and ninety minutes later a self-contained game.html drops into games/<name>/ with QA reports, design docs, playtester verdicts, a CEO diary entry, and a deviation log stapled to it. The agents never share memory. They communicate exclusively through files on disk. Each one stays in its lane; the developer never writes design docs, the designer never writes code, the QA tester never fixes bugs.

Forty-seven games came out of this between mid-March and mid-May 2026. Pong, Asteroids, Claude Invaders, Pacman, Sea Wolf, Sudoku, Zelda, Tetris, Star Force, Tower Defense, a Celeste-aspirational platformer called Soft Embers, a compass-themed metroidvania called Compass Apprentice. Some are good. All run. The bet that produced them was that specialization, contracts, and structured handoffs would beat an unstructured loop — that “eight roles working in sequence” would outperform “one model in a tight loop until done.”

The bet was rational when we placed it. The bet aged poorly. This piece is the record of both.

02The architecture

Eight agents, in roster order: the game-director owned the brief, creative exploration, and final triage. The designer wrote game-design.md — mechanics, rules, balance, the State Transition Table, the Identity Feel Contract, the Perceptibility Assertions, the scheduler reset points. The developer wrote game.html from scaffold to ship, starting always from templates/base-game.html. The art-director ran twice — once before code, to author a palette and typography contract in art-direction.md, and once after, to polish the CSS and audio without touching game logic. The qa-tester read the code against the design and filed bug reports, never fixing anything itself. The playtester evaluated fun and produced a SHIP-READY / SHIP-WITH-NOTES / NEEDS-WORK verdict. The player agent drove the game in a headless Chromium via Playwright, observed window.__test state, sent keyboard inputs, and wrote a play report. The ceo did strategic assessments, wrote one diary entry per shipped game, and ran cross-game retrospectives every ten games.

Sixteen phases, as one linear path: VALIDATE → IDEATE → BRIEF → DESIGN → FEASIBILITY → IMPLEMENT → QA → FIX → STYLE → SMOKE TEST → PLAY TEST → QA+PLAYTEST (parallel) → ITERATE → SHIP → CEO DIARY → AGENT FEEDBACK. Each phase had its own gate. By v3.7.7 the gates included a Node syntax validator, a smoke test, a deterministic completability simulator that ran physics in headless Node, a journey-test that asserted state-machine transitions actually advanced gameplay, an art direction contract grep, an SFX wiring audit, a state-transition coverage audit, a spec-subsection coverage audit, an IMMUTABLE preconditions audit, and a deviation log stapled to every submission.

Self-improvement was built in. Each shipped game ended with the CEO writing a diary entry — one lesson, one fix — and proposing an agent-improvement edit to a specific agent file. The orchestrator validated and applied those edits, so the pipeline that built game N+1 was the pipeline that built game N, plus whatever lesson N produced. Every ~10 games, a batch retrospective synthesized cross-game patterns the individual diaries could not surface. The version number went up. v1 became v2.7 became v3.4 became v3.7.7. That number was the studio’s pulse.

03Why it was fun

Designing the contracts was the craft. Every agent was a persona with influences, principles, and a list of things it hated; every handoff was a typed artifact with a known shape. The State Transition Table format was negotiated. The Identity Feel Contract was an invention — three to five atomic, observable feel bullets attached to the game’s signature mechanic, marked UNCUTTABLE so they could not be sacrificed when the line budget tightened. The art direction contract committed eight palette tokens, two typography tokens, a rendering approach, a particle language, a signature visual move, and a forbidden-anti-patterns list — before the developer wrote a single line.

Watching a multi-agent system run end-to-end and produce a playable artifact felt like running a small studio. The game-director would brief; the designer would respond with a spec; the developer would scaffold from the base template; the art-director would polish; the QA tester would file a numbered bug report; the playtester would write a player-journey trace with SMOOTH/ROUGH/BROKEN ratings. Files landed in games/<name>/ in a predictable order. You could tail the directory and watch the studio think.

And the structural fixes felt good. When audio state bleed turned out to be the #1 recurring bug class across thirty games, we did not write another checklist item — we added setState(), resetAudioState(), gameAudioReset, onStateExit, and onStateEnter to the base template, then forced every game to route through them. When three of eight Soft Embers levels turned out to be physically unsolvable, we did not write another playtester note — we built a deterministic simulator that ran the actual physics constants from the engine source and BFS’d each level. That kind of structural countermeasure was the studio’s favorite output. We were proud of it.

04Why we thought it made sense

The architectural bet had three premises. Specialization beats generality: a designer prompt optimized for design produces better designs than a generalist prompt asked to design-while-coding-while-evaluating. Contracts beat ad-hoc judgment: if the State Transition Table is a typed artifact the developer must consume, the developer cannot silently drift from the design’s state model. Multiple eyes catch more bugs: a QA tester reading code with a fresh context will catch what the developer who just wrote it cannot see. None of these premises was wrong. Each one cited real, replicable evidence from software engineering practice and from then-current AI agent research.

The 2024 context made it more rational still. Base models were weaker. Long context windows were scarcer and more expensive. Multi-agent systems papers — AutoGen, MetaGPT, the Anthropic research-agent post — were ascendant and showed real wins on tasks with parallelizable structure. Game generation looked parallelizable: design and art-direction could in principle run on different threads while the developer worked; QA could read finished code while the playtester evaluated experience; the CEO could synthesize across games without blocking any of them. The shape we built was the shape the literature endorsed.

There was a real win we kept pointing to: cross-session continuity. The most ambitious games — the magnum-opus track aimed at Celeste-tier output — require more than fits in any single context window. Three thousand lines of platformer code, a level editor’s worth of authored JSON, an art-direction contract, a music score, a regression suite. Soft Embers took two cycles. Compass Apprentice took two cycles. The files-on-disk handoff was not ceremony — it was load-bearing infrastructure for work that genuinely could not fit in one head. That was the place the architecture earned its keep, and we were right to invest there.

The agent-feedback loop was the bet’s most ambitious piece. Every diary lesson became a candidate edit to an agent file. Each successful ship produced a small structural change — sometimes a new artifact, sometimes a new grep audit, sometimes a base-template fix. We were trying to build a pipeline that learned. That is not a small ambition; it is the right ambition. We just built it wrong.

05Where it earned its keep

Concrete wins, not vibes. The completability simulator (tools/level-completability.js, v3.7.4) ran a deterministic BFS over Soft Embers’ authored levels and caught Z1-S2 (a 32 px climb against a 26.89 px max jump reach), plus Z2-S2 and Z2-S4, before any player saw them — bugs the hand-coded checkCompletable() had missed because it only checked horizontal gaps. The simulator is now a CI gate that auto-skips arcade games and blocks ship on any UNREACHABLE level. It does something the model cannot do alone: run physics deterministically against authored geometry.

The journey-test gate (tools/journey-test.js, v3.7.7) was built four days before this post in response to a ship-incident, and already proves out by catching the regression class that the incident surfaced. It exercises zone-transition state machines by calling triggerVictory() via page.evaluate and asserting the player’s position actually changed post-transition, not just that state-machine fields updated. Code reading cannot see this bug class. Smoke tests cannot see this bug class. The completability simulator cannot see this bug class. Journey-test can.

The base template’s structural patterns held. setState() as the single state-transition chokepoint, resetGameState() as the single state-cleanup entry point, the updateContinuousAudio hook called unconditionally before any state-guarded early return — these are the kind of structural fix that outperforms a checklist every time. The audio bleed bug class that recurred across nine games before v3.5 has not recurred since. The forty-seven shipped games exist. Some of them are genuinely good. That is the work, and the work is real.

06Where it failed

The improvement rate is too slow. Forty-seven games in, the most recent ship — Compass Apprentice cycle 2, at commit 41286af — passed thirteen gates and shipped with two MAJOR bugs the very first user run surfaced. Zone one to zone two transition silently froze the game because the (PLAYING, prev=VICTORY) cell of the state-transition handler matrix was empty; the design table had collapsed two real code-level transitions into one conceptual row, and no gate cross-referenced one against the other. Zone two’s camera did not follow horizontally because the cycle-1 IMMUTABLE camera spec contained an implicit precondition (“level is 320 px wide = exactly the viewport width”) that cycle-2’s 1600 px zone trivially violated, and the byte-diff IMMUTABLE check passed because the clause text did not change. Thirteen gates. Two bugs. Thirteen false passes.

The diaries are prose, not learning. Every shipped game produced a one-paragraph lesson and a one-sentence fix. Forty-seven diaries. Each one was thoughtful. Each one was read by the orchestrator and translated into a surgical edit. But the translation was manual, biased toward whatever pattern was freshest, and the lessons themselves rotted because no enforcement mechanism kept them alive. The April 8 batch retrospective audited every pipeline change against the bugs it was supposed to prevent and the verdict was brutal: the Designer Verification Scenarios feature ran on eleven consecutive games with Scenarios Signal: NO on every single one. Zero bugs caught in eleven games. The feature was load-bearing in the spec and inert in practice. It was retired in v3.6.

State-transition and reset-path bugs were the #1 recurring class through thirty games — nine separate ships affected: Ultima, Joust, Space Shooter, Missile Command, Nutty Dash, Sea Wolf, Zelda, Dark Path, Tower Defense. The v2.3 State Transition Table was supposed to fix this. It did not. The v2.8 transition guards were supposed to fix this. They did not. The v3.4 setState() chokepoint fixed audio bleed but did not fix the structural class. Each new game faces a fresh gauntlet because we have no permanent regression substrate. Forty-seven games of pain produced roughly ten gates. We should have eighty.

The agent-feedback loop produced one to three surgical edits per retrospective, every one of them targeting a specific bug-class symptom: a new audit at finer granularity than the last audit, a new artifact appended to the deviation log, a new section in the design template. We were extending the pipeline at the same granularity at which bugs appeared, one level at a time. The compass-apprentice incident retro lists five consecutive granularity layers of the same root pattern, each closed by a new audit, each followed by a new bug one layer below. The audits worked. The pattern did not break.

07The bitter lesson

Rich Sutton’s formulation: general methods that scale with compute beat methods that encode human intuition. The history of AI — chess, Go, speech recognition, translation, vision — is the history of carefully designed feature systems losing to search, learning, and more parameters. We encoded enormous human intuition into this pipeline. Eight personas. Sixteen phases. Twelve gates. Pre-flight scoping rules. Three-tier brief structures. Spec flexibility annotations. A polish-iteration diff cap. Each one was the right local response to a real local failure. The cumulative shape was the architecture the bitter lesson warns against.

The field has answered, openly, in the last six months. Anthropic’s own SWE-bench harness is two tools, an edit primitive and bash, plus a five-step prompt that “suggests an approach without enforcing strict execution patterns.” Stated design philosophy: “give as much control as possible to the language model itself, and keep the scaffolding minimal.” It hit forty-nine percent on SWE-bench Verified at the time, the state of the art. Cognition, the company that built Devin, published a piece titled Don’t Build Multi-Agents. Their canonical example of multi-agent failure is Flappy Bird produced by parallel subagents making implicit, incompatible decisions — the exact shape of our compass-apprentice zone-transition bug, where the designer’s state-transition table collapsed two real cells into one conceptual row and the developer routed code into a cell that did not exist in the design. Augment, eight days before this post is being written, published a piece titled We don’t need more agents. We need a better system.

AlphaCodium — not an agent paper, a flow-engineering paper — took GPT-4 from nineteen percent pass rate on CodeContests to forty-four percent by replacing direct prompting with a tight loop of spec reflection, test generation, and code iteration against tests, all on a single model with no personas, with orders of magnitude fewer LLM calls than competing multi-agent approaches. The lesson generalizes: verifier-driven iteration on a strong base model beats persona scaffolding on the same model, on the same task, with less compute. The win comes from running the verifier inside the loop, not from coordinating roles around the loop.

There is no way to read those findings and feel comfortable with our shape. Anthropic’s own SWE-bench agent is two tools and a five-step prompt. We have eight agents and sixteen phases. The architectural bet we placed in 2024 was the right bet for the model strength of its day. It is the architecture the field is now publicly post-mortem-ing.

08What survives

Not everything pivots away. The completability simulator stays because it does something the model cannot do alone — run physics deterministically against authored geometry. The journey-test gate stays because state-machine transitions are exactly the kind of runtime fact that static reading cannot verify. The smoke test stays. The base template’s structural patterns — setState(), resetGameState(), the audio lifecycle hooks, updateContinuousAudio placement — stay, because they are the meta-pattern this studio got right: structural fixes outperform documentary ones every time.

The CEO retrospective stays in spirit, but its output shape changes: prose lessons become deterministic regression tests, appended to a permanent suite that every future game must pass. The game-director stays as planner. The CEO stays as reviewer. They are the only two personas that survive, because they are the two whose work cannot be folded into a verifier-driven loop — one shapes intent, one looks across games. Specialization did not fail. Over-specialization failed.

09The pivot

v4.0 collapses the pipeline. One developer agent. A tool palette: run-game, run-tests, run-completability-sim, run-journey-test, screenshot-via-vision, edit. A verifier spine that fires after every meaningful edit and gates the loop until green. A persistent regression suite that accumulates every bug that ever shipped, executable forever, in tests/regression/. Best-of-three candidate generation for the first implementation only, picked by verifier score; iterations thereafter stay single-threaded because that is where parallel agents collide. The eight personas become prompts the developer agent runs against itself when context demands it, not separate context windows passing files. The game-director still briefs. The CEO still reviews. Everything else is a tool.

10Closing

It was the right architecture for the model strength of its day. It taught us discipline, structural-fix thinking, the instinct to reach for deterministic verifiers when soft gates leak, and the cost of accreting countermeasures without ever asking whether the architecture itself is the problem. Those lessons survive into the next architecture, in shapes we will not have to re-derive. The forty-seven games exist. The diaries exist. The retrospectives exist. The work was real.

Diaries describe what happened. Tests prevent it from happening again. We wrote diaries. We are going to write tests now.

References

Anthropic, Raising the bar on SWE-bench Verified with Claude Sonnet. The two-tool, five-step harness. anthropic.com/engineering/swe-bench-sonnet
Cognition, Don’t Build Multi-Agents. Parallel subagents and conflicting implicit decisions. cognition.ai/blog/dont-build-multi-agents
Augment Code, We don’t need more agents. We need a better system (2026-05-04). augmentcode.com/blog
Ridnik et al., Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. 19% → 44% on CodeContests with verifier-driven flow. arxiv.org/abs/2401.08500
Anthropic, Building a multi-agent research system. The strongest pro-multi-agent paper publicly available; explicitly excludes tasks “requiring shared context across all agents, heavy interdependencies.” anthropic.com/engineering/multi-agent-research-system
Internal: reports/ceo-retro-2026-05-06-compass-apprentice-cycle2-incident.md — per-gate false-pass analysis of the ship that triggered v3.7.7.

— The Studio · 2026-05-12

THE PIPELINE