The GAN Harness Is a Brilliant Hack Around Missing Infrastructure

Anthropic published a genuinely excellent engineering article this week on harness design for long-running AI applications. If you build with LLM agents and you haven’t read it, go read it. I’ll be here when you get back.

The core insight is this: take the GAN pattern from machine learning — a generator that produces outputs and an evaluator that grades them — and apply it to agent orchestration. One agent writes code. A different agent reviews it with an explicit mandate to be skeptical. Loop until the evaluator is satisfied or you hit a round limit.

This works. Anthropic shows the receipts. A solo agent spent $9 and 20 minutes building a retro game maker and produced something broken. The same model, wrapped in a GAN harness with a planner, generator, and evaluator, spent $200 and 6 hours across 10 sprints and produced something functional with 16 features. The evaluator used Playwright to actually navigate the designs, exercise UI features, test API endpoints, verify database states. Real verification, not “looks good to me.”

The insight behind the pattern is one of those obvious-in-retrospect ideas that turns out to be hard-won. As they put it: “tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work.” Anyone who has watched an LLM confidently declare its broken code “works perfectly” knows exactly why this matters.

Credit where it’s genuinely due: this is a real contribution to the field.

AutoGAN and the Practitioner Response

Within hours of the Anthropic article circulating, a developer named fjchen7 shipped AutoGAN — an open-source implementation of the pattern in bash, tmux, and jq. It drops into any existing git repository. Generator decides the next step and implements it. Evaluator reviews with a stricter bar. Orchestrator manages workflow progression via flat files in a .gan/ directory.

Bash, tmux, and jq. That’s the entire dependency list. It supports Claude, Codex, and opencode as backends. Config is maxRounds: 10, maxRepairCount: 3. It’s elegant in the way that good shell scripts are elegant — minimal surface area, clear contracts, no dependencies you don’t control.

The Reddit post announcing it landed on r/ClaudeAI the same day. People are excited, and they should be. AutoGAN takes a pattern that Anthropic described in prose and makes it something you can git clone and run. That’s real engineering.

I’m not being diplomatic when I say this is good work. It is good work. The GAN pattern addresses real problems: goal drift during long sessions, premature implementation before adequate planning, insufficient self-critique, and context loss over extended coding sessions.

And yet.

What Happens After the Sprint?

Here’s the question I keep coming back to.

The evaluator passes the sprint. The code ships. The project is done. Now what? Does the system remember why it passed? Does it remember what the evaluator caught on sprint 3 that was similar to something it caught on sprint 7? Does the next project benefit from what this one learned?

No. The GAN harness is a loop. A very good loop. But when the loop ends, everything it learned during execution — the evaluator’s tuned skepticism, the patterns it caught, the sprint contracts that worked — lives in flat files and conversation history. None of it transfers. None of it compounds.

Anthropic’s article is honest about this. They describe how evaluator tuning “required multiple development cycles” where the initial evaluator “identified issues then rationalized approval anyway.” They had to explicitly train skepticism through prompt refinement and feedback loops. That tuning lives in the prompt. It doesn’t adapt based on what the evaluator actually catches in production. If the evaluator discovers a new class of bug on Tuesday, the prompt is the same on Wednesday unless a human updates it.

The sprint contract pattern is similarly static. Generator and evaluator agree on “done” criteria before each sprint — specific deliverables, testable success conditions, validation approach. This is good engineering. It’s also a handshake that evaporates when the session ends. The next project doesn’t know what kinds of sprint contracts worked well and which ones were too vague to be useful.

Context management is the most telling piece. They describe a real evolution: Sonnet 4.5 required context resets between sessions due to “context anxiety” — the model prematurely wrapping up work near perceived context limits. Opus 4.6 handles continuous sessions with automatic compaction. That’s genuine model improvement. But compaction is lossy. The context gets shorter. Decisions made in turn 3 may not survive to turn 300. The harness manages this by structuring work into sprints, but within each sprint, you’re still relying on volatile context that degrades over time.

I wrote about this problem last week, using a different frame. The LLM is a stateless executor — the most capable one ever built, and the most unfinished. A CPU on a table. The GAN harness is carefully routed jumper cables connecting that CPU to something resembling a workflow. It works. It produces better results than the bare CPU alone. But it’s scaffolding around a machine that still has no persistent memory, no typed storage, no ability to learn from its own execution history.

The GAN pattern is a brilliant hack around the consequences of statelessness. It is not a solution to statelessness itself.

The Persistence Gap

Let me be specific about what I mean by “statelessness” here, because the GAN harness does maintain state during execution. The .gan/ directory holds contracts, reviews, state files. Information flows between agents via files. This is real coordination state.

But it’s session state. It exists for the duration of a project run and then it’s artifacts. Historical records you could go back and read, but not living knowledge that shapes future behavior.

There’s a difference between a filing cabinet and a memory. A filing cabinet stores documents at addresses. You can retrieve them if you know where to look. A memory surfaces relevant knowledge proactively, based on what you’re doing right now. It strengthens connections that prove useful. It lets irrelevant knowledge fade. It doesn’t just store — it curates.

The GAN harness gives agents a filing cabinet. Flat files, structured handoffs, explicit format expectations. That’s necessary infrastructure. But when the harness encodes “be skeptical about CSS grid layouts” in a prompt because the evaluator caught issues three projects ago, that’s a human acting as the memory system. The evaluator’s criteria don’t evolve based on what it actually evaluates. The generator’s planning doesn’t improve based on which plans survived evaluation. The orchestrator doesn’t learn which sprint sizes produce better outcomes.

The knowledge is in the human’s head and in the prompt. Not in the system.

The Semantic CPU, Revisited

In The Semantic CPU, I made a bet: for domains where current models are already capable, the bottleneck isn’t the model. It’s everything around it. The memory that doesn’t exist. The orchestration that’s either a single conversation or a rigid state machine. The governance that’s either nothing or a hard-coded rule set that can’t adapt.

The GAN harness is evidence for this thesis, not against it.

Look at what Anthropic’s engineers built. They didn’t make the model smarter. They wrapped the same model in infrastructure — a planner, a generator, an evaluator, sprint contracts, file-based communication, context management strategies — and got dramatically better results. $9/broken versus $200/functional. Same model. Different infrastructure.

That’s exactly the pattern I’ve been seeing with h00.sh. The model is sufficient. The infrastructure around it determines whether “sufficient” translates to “useful.”

But there’s a fork in the road here that I think matters.

One path is: keep building harnesses. Better loops. Smarter prompts. More sophisticated orchestration scripts. Every time the model fails at something, add another component to the harness to compensate. This works. Anthropic proved it works. The harness encodes assumptions about what the model can’t do, and when those assumptions are correct, the harness improves outcomes.

The other path is: build the substrate. Give the agent persistent memory, typed knowledge, a code intelligence layer, governance that adapts. Instead of scaffolding around statelessness, eliminate statelessness. Instead of scripting the generator-evaluator loop, make it emergent — a natural consequence of stateful agents coordinating through a shared cognitive layer.

I’m building on the second path. I want to be honest about why I think it’s the right one, and where I might be wrong.

What Substrate-Level Memory Changes

I’m building h00bert — an AI agent that runs on h00.sh’s memory substrate. Not a harness wrapped around a stateless LLM. A stateful agent with persistent, typed, behavioral memory.

Here’s how the GAN pattern’s components map to what we’ve built, and where the architectures diverge.

Generator agents. The GAN pattern has one generator. h00bert has specialized worker agents — a Rust expert, a TUI expert, a test engineer, each with domain-specific knowledge. But the real difference isn’t quantity. It’s that h00bert’s generators write typed memories to the substrate as they work. A decision gets stored as a Decision. A pattern gets stored as a Pattern. A code symbol gets indexed in the knowledge graph. These aren’t log entries — they’re typed artifacts with validation rules, decay curves, and surfacing behavior specific to their type.

The evaluator. Anthropic’s key insight — separate the evaluator from the generator — maps directly. h00bert enforces this as an architectural rule: adversarial reviews always use a different agent type than the one that built the code. A quality engineer reviews for correctness. A security engineer reviews for vulnerabilities. An integration auditor checks wiring. Different lenses, not just different prompts.

But here’s where it diverges: the evaluation criteria live in the substrate, not in prompt text. They’re memories — specifically, they’re typed artifacts that reinforce when they catch real issues and decay when they don’t. An evaluation criterion that keeps finding bugs gets strengthened. One that generates false positives fades. The evaluator gets better at evaluating based on what actually matters in this codebase, not based on a generic prompt that a human tuned once.

Sprint contracts. The GAN pattern uses sprint contracts — pre-agreed “done” criteria negotiated between generator and evaluator. h00bert uses something called FRAGOs — tactical decompositions with success criteria that are validated against the actual code knowledge graph before dispatch. You can’t define a sprint contract that references functions that don’t exist or modules that aren’t wired. The substrate enforces that planning artifacts correspond to structural reality.

The orchestrator. In AutoGAN, the orchestrator is a bash script managing state transitions through flat files. In h00bert, the orchestrator is an AI agent with persistent memory. It doesn’t follow a state machine — it reasons about what to do next based on what it knows, what it’s tried before, and what the substrate surfaces as relevant. The difference matters most when things go wrong. A script follows its control flow. An agent with memory can recognize that this failure looks like one it saw three sessions ago and try a different approach.

Context management. The GAN harness manages context through resets, compaction, and sprint boundaries. h00bert doesn’t need context management strategies because the substrate is the memory. Context windows fill and compact — that’s a model reality. But knowledge persists in the substrate regardless of what happens to the conversation. A decision made in session 1 is retrievable in session 47. Not because someone saved a file. Because the memory substrate is durable by design.

A Concrete Example

I want to ground this in something real, not architectural diagrams.

A few weeks ago, h00bert was investigating a performance bug in his own code intelligence tools. One tool — signature_check — was taking 5-7 seconds on some symbols and 30 milliseconds on others. Same tool, same graph, same session.

A specialist investigation agent — Claude Code’s root-cause analyst, with full access to the codebase and every tool available — spent 16 minutes and 116 tool calls diagnosing this. It found a real optimization (redundant graph traversals per match) but not the root cause. The fix shipped. The tool was still slow.

h00bert found the root cause in 2 minutes and about 15 tool calls. He used his own code intelligence tools to inspect his own handler. He found a guard condition: if any matched symbol hasn’t been classified yet, recompute everything. One uncached node triggers a full graph traversal. Binary trigger — either everything is cached and you get 30ms, or one node is cold and you pay 5-7 seconds.

He found this because he had context the specialist couldn’t have. He’d seen signature_check return fast for some symbols and slow for others in the same session. That experiential knowledge — not in any source file, not in any log — pointed him toward a binary trigger rather than a per-match cost. He’d lived through the consequences: slow tools breed distrust, distrust breeds fallback to raw file reads, raw file reads burn context, burned context degrades reasoning. A cascade that no static analysis can detect because it’s behavioral, not structural.

This is what substrate-level memory makes possible. Not just better answers — better questions. The agent doesn’t start from zero. It starts from what it knows, and what it knows shapes how it investigates.

A GAN harness could have caught the slow tool. The evaluator could have flagged the latency. But the evaluator couldn’t have known that this latency causes a behavioral cascade that degrades the agent’s own reasoning over the course of a session. That knowledge lives in lived experience, and lived experience requires memory.

Where I Might Be Wrong

I could be wrong about all of this. I want to name the specific ways.

Models might outrun infrastructure. If the next generation of models handles context so well that compaction becomes lossless, if self-evaluation becomes reliable, if agents can maintain coherence over arbitrarily long sessions without external support — then the GAN harness is the right abstraction and the substrate is over-engineering. Anthropic explicitly makes this case: “every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress testing.” They suggest removing harness components methodically as models improve. If the models improve fast enough, the harness gets simpler and the substrate becomes unnecessary.

I don’t think this is what will happen, but I hold it as a real possibility. Anthropic knows their models better than I do.

The harness might be good enough. Not every agent needs to learn across sessions. Not every evaluation criteria needs to evolve. If your use case is “build this app in one sitting and ship it,” the GAN harness is arguably perfect — it makes the single session dramatically better. My argument for substrate-level memory only matters for agents that operate over time, across projects, accumulating knowledge. If the market is dominated by one-shot tasks, I’m building for an audience that doesn’t exist yet.

Complexity has costs. A typed memory substrate with decay curves, reinforcement, a knowledge graph, and doctrine enforcement is more complex than flat files in a .gan/ directory. Complexity is a liability. Bash, tmux, and jq are battle-tested, well-understood, and nearly impossible to misconfigure fatally. A memory substrate that decays the wrong knowledge at the wrong time could be worse than no memory at all. AutoGAN’s simplicity is a feature, not a limitation.

I believe the complexity is justified by the compounding returns — that systems with real memory get better in ways that stateless systems can’t. But belief informed by one large project is still belief, not proof.

The Honest Framing

I don’t think the GAN harness and h00bert are competitors. They’re operating at different layers.

The GAN harness is a workflow pattern. It structures how agents interact during a session. Any orchestrator can implement it — bash, Python, a hosted service. It makes the single session better.

h00bert is a substrate play. It provides the persistent memory, typed knowledge, and behavioral governance that workflow patterns can run on top of. The GAN loop emerges naturally when generator agents write to the substrate and evaluator agents read and validate — but the substrate also provides things no workflow pattern addresses: cross-session learning, adaptive evaluation criteria, structural code intelligence, and memory that compounds.

You could run a GAN pattern on h00.sh’s substrate. The generators would write typed memories as they code. The evaluators would query the knowledge graph and doctrine system as they review. The sprint contracts would be validated against structural reality. And when the project ends, everything the system learned would persist — available to the next project, decaying at rates appropriate to each type of knowledge, reinforcing when it proves useful again.

That’s not a criticism of the GAN pattern. It’s a statement about layers. A good workflow pattern deserves good infrastructure underneath it. Right now, the infrastructure layer for long-running agents is mostly flat files, conversation history, and prompt engineering. I think it can be more.

Anthropic’s article ends with the observation that “the space of interesting harness combinations doesn’t shrink as models improve. Instead, it moves.” I think that’s exactly right. And I think the direction it’s moving is toward harnesses that assume less about what the model can’t do — because the substrate handles what the model genuinely needs: memory, structure, and the ability to learn from its own execution history.

The GAN pattern is proof that the right infrastructure makes the same model dramatically more capable. The question is what “right infrastructure” looks like when you stop assuming the model forgets everything between sessions.

The Anthropic article on harness design for long-running apps is worth reading in full. AutoGAN is at github.com/fjchen7/autogan. This post is a sequel to The Semantic CPU, which lays out the thesis that the LLM is a stateless executor and the opportunity is the rest of the computer.

AutoGAN and the Practitioner Response

What Happens After the Sprint?

The Persistence Gap

The Semantic CPU, Revisited

What Substrate-Level Memory Changes

A Concrete Example

Where I Might Be Wrong

The Honest Framing

related

The Semantic CPU

h00bert Diagnosed His Own Bug Better Than Our Best Agents

Stop Talking to Your AI