How We Built It · Part 8 of 10

War Stories: The Day the Swarm Hung

This is not a highlight reel. This is the session where firing several parallel jobs at once tripped a shared rate limit, most of the work came back empty, and one large job simply stopped responding. Here is exactly what happened, how we recovered, and the rule that has prevented a repeat ever since.

The Setup: When "More" Seemed Like the Obvious Answer

By the time we had a working swarm, the logic felt clear: parallel is faster. You have a big batch of content to generate, you have multiple agents standing by, so you fire them all at once. We had done smaller fan-outs successfully. A few jobs running in parallel, results coming back cleanly, pages getting queued for review. The pattern seemed solid.

So when we had a genuinely large batch to knock out, we scaled up. Several big jobs, each responsible for a meaningful chunk of pages, all launched in close succession. The thinking was efficiency: get them all running, go handle something else, come back to a full harvest. That thinking was wrong.

What Actually Happened

The first sign that something was off was silence. Jobs that should have been reporting progress were not. When we checked on them, most had returned results — but the results were largely empty. Not errors exactly, just work products with nothing in them. A content job that should have returned dozens of pages had produced a handful, or sometimes none.

The second sign was worse: one large job had stopped progressing entirely. It was not erroring out in any clean way. It was just hanging — the process alive, no output, no forward motion. It occupied a slot without doing anything useful.

The root cause, once we understood it, was straightforward. The shared rate budget that governs how many API calls can be made in a given window is global — it does not expand because you have more jobs. When several large jobs competed for that same budget simultaneously, most of them hit the ceiling and stalled waiting for capacity that was already consumed by their siblings. The hung job was the extreme case: it had reached a point in its task that required a call, could not make one, and sat there waiting indefinitely.

Hard-won lesson: A shared rate budget does not multiply with the number of jobs. Firing several large jobs at once does not run them in parallel — it runs them in contention. Most of them slow to a crawl or stop. You get less done, not more.

The Recovery: The Journal Was Everything

The first instinct when something hangs is to kill it and start over. We almost did. If we had, we would have lost all the work the job had completed before it stalled. The thing that stopped us was realizing the job had been writing to its own journal as it went.

Each job in the swarm was structured to write completed work to disk incrementally — not to hold everything in memory and flush at the end. This was not a deliberate disaster-recovery decision at the time; it was just how the jobs were built. But it became the thing that made recovery possible.

We killed the hung job, then opened its output directory. The pages it had finished before stalling were all there, properly formed, ready for the quality gate. We harvested them. We assessed how far the job had gotten, identified the remaining scope, and ran a single controlled replacement job to cover the gap. The total work lost was minimal — a small fraction of what the original job would have produced, and that fraction came back cleanly on the second pass.

Key takeaway: Incremental writes to disk are cheap insurance. A job that writes its results as it goes is recoverable. A job that holds everything in memory until the end is a single point of failure. Build your jobs to write early and often.

The other jobs that had returned mostly-empty results were a similar story. Some had managed to complete a handful of pages before hitting the rate ceiling. We harvested those too, then re-ran the remainder in a single controlled pass once the rate budget had recovered. Nothing was truly lost — it was just scattered across the job journals rather than sitting neatly in a final output file.

The Rule That Came Out of It

After the recovery, we wrote the rule that has governed swarm operations ever since. It is short enough to memorize:

One controlled job at a time. Fan-out happens within a single job — many workers operating in parallel under one coordinator, with a shared awareness of the rate budget. Multiple large jobs competing independently for the same budget is not parallelism; it is contention.

The practical shape of this rule looks like the following pattern. Instead of launching three jobs of one hundred pages each simultaneously, you launch one job that manages three hundred pages internally — batching requests, tracking the budget, spacing calls when needed. The job itself can use workers in parallel, but those workers report to a single coordinator that knows the global capacity. The coordinator becomes the traffic cop that the rate limiter was already expecting you to have.

# Swarm structure that respects the rate budget

JOB (single coordinator)
  - Knows the total page target
  - Tracks calls made per window
  - Batches workers in small groups (e.g., 5 at a time)
  - Waits when nearing the window ceiling
  - Writes each completed page to disk immediately
  - Reports progress to the handoff file as it goes

WORKERS (parallel within the job, not competing across jobs)
  - Each takes a scoped subtask from the coordinator
  - Returns result + writes to disk
  - Signals done to coordinator
  - Does not spawn further workers

The other half of the rule addresses recovery. Every job must write its state — what has been attempted, what has succeeded, what has not yet been reached — to a journal that survives the job's own process. If the job hangs or is killed, the journal is the source of truth for what can be harvested and what needs to be re-run.

The Fabrication Problem Appeared Here Too

While we were auditing the recovered pages from the hang incident, the quality gate flagged something unrelated to the rate limit. Roughly one in five pages in that batch contained a factual error — a wrong date, an invented detail, a claim that sounded right but was not. These were not hallucinations in the dramatic sense; they were small confident inaccuracies of the kind that slip past a casual read.

Hard-won lesson: The rate limit failure made the fabrication problem visible. Without the disruption forcing us to look closely at the recovered pages, those errors would have shipped. The gate is not optional when the stakes feel low — it is most necessary exactly when the volume is high and manual review feels impractical.

The gate at that point was a second, skeptical review pass run on every page before it entered the deployment queue. It was not sophisticated: a prompt instructed to find factual claims it could not verify, flag invented specifics, and return a list of concerns. Pages with concerns went back for correction. Pages that cleared went forward. It was slower than shipping everything directly, but the alternative — publishing roughly one fabricated claim for every five pages — was not a viable alternative.

What "Silent Failure" Actually Costs

The hung job made a loud kind of failure: nothing came out, you could see something was wrong. The rate-saturation on the other jobs made a quieter failure: something came out, but much less than expected, and without a count check you might not notice the shortfall. The fabrications made the quietest failure of all: everything came out, the volume looked right, but the content was wrong in places that a quick skim would not catch.

The lesson across all three is the same. Systems do not fail by announcing themselves. They fail by producing output that looks normal. The only reliable defense is a gate that checks the output against what it was supposed to be — not just that something was returned, but that what was returned is actually correct.

Hard-won lesson: Monitor for the failure signatures, not just the success ones. A monitor that alerts when the error rate spikes tells you nothing about a job that quietly produces half the expected output with no errors. Count checks, content audits, and spot verification are not paranoia — they are the mechanism by which quiet failures become visible.

The Three Rules the Incident Produced

If you take nothing else from this chapter, take these three rules. They apply whether you are running two agents or twenty.

One controlled job at a time. Fan-out happens within a single coordinator, not across competing independent jobs. The rate budget is global; treat it that way.
Write to disk incrementally. A job that writes results as it goes is recoverable from any point. A job that writes only at the end is a single point of failure. The journal is your recovery path — maintain it from the first page, not the last.
Gate the output before it ships. High volume is the wrong time to skip the quality gate — it is the right time to enforce it most strictly. The gate exists precisely because review at scale is where errors hide.

Swarm Journal + Recovery Checklist

The exact checklist we use before launching a batch job — covering coordinator setup, incremental write structure, rate budget tracking, and the post-run harvest procedure. One page, plain text, adapt it to whatever tooling you use.

Download the checklist

What Comes Next

The architecture, the memory system, the gates, the coordination rules, and now the real failure stories — that is the full system as it operates today. Part 9 turns to the economics: what this costs in practice, what actually compounds in value over time, and how to think about where to invest your effort as the system matures.

← Part 7: Coordination Without Clobbering Part 9: The Economics →