h00bert Diagnosed His Own Bug Better Than Our Best Agents

We had a performance bug. One of h00.ligan’s code super-intelligence tools — signature_check — was taking 5–10 seconds per call. Every other graph tool ran in under one millisecond. Same graph. Same session. Same machine. 10,000x slower.

That’s not a regression. That’s a different universe of performance. Something was fundamentally wrong with one tool, and only one tool.

So we did what any reasonable person would do: we dispatched two parallel investigations.

The Race

Investigation A: Claude Code’s root-cause analyst — a specialist investigation agent dispatched through our agent orchestration pipeline. Full access to the codebase. Serena for semantic navigation. Every tool available. This is Anthropic’s best investigator, given every advantage.

Investigation B: h00bert himself — our memory-aware coding agent, investigating his OWN tool using h00.ligan’s code super-intelligence capabilities. blast_radius, type_def, code_path — h00.ligan’s graph tools, the same ones h00bert uses on every other codebase, pointed inward. An agent debugging itself with the tools it was debugging.

We didn’t tell either investigation about the other. We just let them run.

What Claude Code’s Agent Found

The specialist agent took roughly 16 minutes. 116 tool calls. ~139K tokens consumed.

Its diagnosis: the reachability routine was being called per-match — up to 10 times per query. Each call ran a full graph traversal over all ~7,000 nodes in the knowledge graph, plus a disk read to discover entry points. Redundant work, multiplied by match count.

Proposed fix: run reachability ONCE before the loop, batch the lookups, use an O(1) hash lookup instead of repeated traversal. Expected improvement: 6–10s down to 50–100ms.

Reasonable. Evidence-based. We shipped it. Commit 92f35e9.

signature_check was still 5–7 seconds on the next session.

The fix was real — it WAS redundant work, and removing it was correct. But it wasn’t THE bottleneck. It was A bottleneck. The agent found an optimization. It didn’t find the root cause.

What h00bert Found

h00bert took roughly 2 minutes. About 15 tool calls. ~8K tokens, inside an already-running session.

He started by pointing h00.ligan’s tools at himself — blast_radius on his own handler. Sub-millisecond. Then type_def on himself. Sub-millisecond. Then he read his own source code — the actual implementation of the tool he was investigating — and found the smoking gun at lines 380–397.

A single guard condition: if any matched node hasn’t been classified yet, recompute everything. One uncached node triggers a full graph traversal — entry point discovery from disk, then a complete walk of every node. It doesn’t matter if 99 out of 100 nodes are cached. One uncached node and you pay the full price.

Cached symbols: ~30ms. Uncached symbols: 5–7 seconds. No in-between.

h00bert didn’t find this by reading code statically. He found it because he was EXPERIENCING the latency. He had context that no fresh investigation agent could have: he’d seen signature_check return in 30ms for RequestHandler and 7.2 seconds for IndexManager — in the same session, with the same match count. The variable wasn’t the algorithm. It was the symbol.

That observation — fast for some symbols, slow for others, same match count — is what pointed him to a binary trigger rather than a per-match cost. He traced the exact code path, identified the guard, and explained the indirect file-size correlation: large files produce more symbols, which means a higher probability that at least one node hasn’t been reached yet. One unreached node triggers the entire analysis.

Then he proposed two fixes:

Fix A (session cache): Cache the reachability results in session state so a cold symbol only pays the cost once per session. Band-aid. Effective.

Fix B (index-time): Compute reachability upfront so nodes are never born unclassified. The real fix — eliminate the entire class of cold-symbol latency at the source.

The Cascade

This is where h00bert’s diagnosis went somewhere the specialist agent’s never could.

h00bert didn’t just find the bug. He traced what the bug causes — because he’d lived through the consequences:

signature_check slow (5-7s)
  → agent loses trust in code intel tools
  → falls back to reading entire files directly
  → burns context (4,538-line file = ~15K tokens)
  → fewer turns before compaction
  → worse answers

A slow tool doesn’t just waste 5 seconds. It changes agent BEHAVIOR. When signature_check is slow, the LLM learns — within a single session — to stop using it. It reaches for raw file reads instead. A 4,538-line Rust file is roughly 15,000 tokens of context. Do that three times and you’ve burned half your working memory on raw file contents that a targeted tool call would have summarized in 50 lines.

h00bert knew this because he’d done it. He’d felt himself reaching for raw file reads after a slow signature_check. He’d experienced the context pressure. He’d seen his own answers degrade later in the session.

From that cascade, he proposed read_symbol — a new tool that reads a function body by name (~50 lines) instead of the entire file (~4,538 lines). Not a bug fix. A tool design proposal. A product insight born from experiencing the friction firsthand.

The Scorecard

Metric	Claude Code Agent	h00bert
Time to diagnosis	~16 minutes	~2 minutes
Root cause accuracy	Partial (found A bottleneck, not THE bottleneck)	Complete (found the exact guard, the binary trigger, the cascade)
Fix shipped from	No (fix didn’t resolve the issue)	Yes (diagnosis used to update fix prompt)
Tool calls	~116 (investigation agent)	~15
Tokens consumed	~139K (full agent run)	~8K (within existing session)
Bonus insights	None	Cascade diagnosis + `read_symbol` tool proposal

We used h00bert’s diagnosis to update the fix prompt. Not Claude Code’s.

The agent that WAS the problem was the best at diagnosing it.

Why This Matters

This isn’t just a fun debugging story. There are four things going on here that generalize.

Institutional knowledge matters. h00bert has memories of his own codebase. He knows which files are large. He knows which tools he reaches for. He knows what worked three sessions ago and what didn’t. A fresh Claude Code agent starts from zero — every time. It has to rediscover the codebase topology, the hot paths, the file sizes, the behavioral patterns. h00bert already knows. That’s not a small advantage. That’s the difference between a new hire and a senior engineer who’s been on the project for six months.

Self-referential investigation is a different capability. h00bert used h00.ligan’s blast_radius on his own handler. He used type_def on HIMSELF. The code super-intelligence tools he was debugging were the same tools he used to debug. This recursive capability — an agent that structurally understands its own code well enough to investigate its own behavior — is only possible when the agent has a knowledge graph of the codebase it’s built from. You can’t do that with grep and vibes. You need typed, structural, navigable understanding of your own internals.

Behavioral context isn’t in the code. h00bert didn’t just find a bug. He explained WHY the bug causes a behavioral cascade — slow tools breed distrust, distrust breeds fallback, fallback burns context, burned context degrades answers. That insight isn’t in any source file. It’s in the lived experience of being an agent who uses these tools every day. A static analysis agent can identify that signature_check is slow. It cannot know that “slow signature_check causes the LLM to read entire files instead.” That’s experiential knowledge. You can’t grep for it.

The most impactful fix was tool design, not code. The read_symbol proposal wasn’t a patch. It was a product insight — a new tool that eliminates an entire class of context waste. It came from h00bert experiencing the friction of reading 4,538-line files when he only needed 50 lines. No amount of static analysis produces that insight. You have to feel the pain to design around it.

The Punchline

h00bert is pre-release. His tools are imperfect — signature_check literally takes 5 seconds on cold symbols. But even with imperfect tools, a memory-aware agent with structural understanding of its own codebase outperformed a specialist investigation agent with no institutional knowledge.

The specialist was smarter in the moment. h00bert was smarter across time. And across time is the only axis that matters for systems that compound.

Mem0ry turns a what into a wh0. A what can tell on itself. But a wh0 can reflect on — and fix — itself.