How We Built It · Part 7 of 10

Coordination: How Multiple Agents Avoid Clobbering Each Other

When two agents touch the same file at the same time, one of them wins and one of them quietly loses — and neither one knows which. Here is how we stopped that from happening, and the short constitution of operating rules that now holds the whole system together.

The problem you only see after it bites you

Up through Part 4 in this series, we had one orchestrator and one or two worker agents. Coordination was simple: the orchestrator directed, the worker executed, nothing overlapped. Then we scaled up. More agents, more parallel work, more shared files — and we hit a failure class that is easy to miss until days of work disappear.

Two agents, each acting in good faith on their own task, both reached for the same shared file. One edited the top section. The other edited the middle. The second agent's version became the deployed version. The first agent's changes were quietly gone. No error. No warning. A clean deploy that happened to roll back work that had taken hours to produce.

That is the coordination problem. It is not dramatic — it does not throw errors or trigger alerts. It just silently undoes things. And because the deploy "succeeded," nobody knows to go looking.

Lesson: Coordination failures are silent failures. The work disappears cleanly. You only find out when you notice the feature is gone, or when a user reports it missing, or — worst case — never.

The first fix: claim before you change

The solution we landed on is conceptually simple and borrowed from systems programming: before you touch a shared file, you claim it. A claim is a lightweight lock that says "I have this path right now." Any other agent that tries to claim the same path gets refused — it reads the current holder, notes who has it, and waits or picks a different task.

The mechanics can be as simple or as formal as your setup warrants. At the simple end, a claims file is a small database table or even a flat file with three fields: the path being claimed, the agent holding it, and when the claim expires. Any agent can read it. Any agent can write a new claim if the path is unclaimed. If the path is already claimed, the second agent backs off.

The critical detail is that claims must be atomic — two agents cannot claim the same path at exactly the same moment. In practice this means using a unique constraint at the database level, or a lock-file pattern where the second writer detects the conflict before overwriting. The goal: the system refuses the second claim rather than silently letting both proceed.

# Minimal claim pattern (pseudocode)
claim(path, agent_id, ttl_minutes=30):
  INSERT INTO claims (path, agent, expires_at)
  VALUES (path, agent_id, now() + ttl_minutes)
  ON CONFLICT (path) DO NOTHING
  -> returns: claimed (success) or refused (conflict)

release(path, agent_id, commit_sha):
  DELETE FROM claims WHERE path = path AND agent = agent_id
  LOG release with commit_sha for audit trail

The rule we added alongside the mechanism: claim on the path you intend to change, before you open the file. Not after you have already edited it. If you cannot claim it, you do not start the edit. You pick a different task or you wait. This sounds obvious, but it requires a standing rule — agents default to "just do the work" unless something stops them.

Key takeaway: Claims are not about trust — your agents are not malicious. They are about sequencing. Without a claim, two agents in parallel both believe the path is theirs. A claim makes that belief exclusive.

The second fix: sync before you deploy

Claims prevent two agents from editing the same file simultaneously. But there is a second, equally dangerous failure that claims alone do not solve: deploying from a stale copy of the code.

Here is how it happens. An agent starts a task and checks out the codebase. While it works, two other commits land on the main branch — improvements from a different agent, improvements that took hours. The first agent finishes its own changes and deploys. Because it deploys from its local copy, which is now out of date, the deploy packages its local version of every shared file — including the old versions of the two files that had been updated. The deploy succeeds. The recent improvements are gone. The version counter moved forward. It looks like progress.

This is not hypothetical. We had it happen with a gap of over 300 commits between the local copy and the live branch. Three days of work, across multiple agents, silently rolled back in a single deploy that "succeeded" with no errors. The only signal was that features the operator had seen working were no longer working.

Lesson: A deploy that succeeds is not a deploy that is correct. "200 OK" confirms the package was received — not that the package contained what you intended. Always verify through the actual public surface, not a synthetic ping.

The fix is a mandatory sync check that runs before every deploy to shared infrastructure. Four lines of shell, non-negotiable:

# sync-check.sh — run before every deploy to shared Workers, origin servers, or shared configs
git fetch origin --quiet
BEHIND=$(git rev-list --count HEAD..origin/main)
if [ "$BEHIND" -ne 0 ]; then
  echo "ABORT: local branch is $BEHIND commit(s) behind origin/main. Pull first."
  exit 1
fi
# only reach here if in sync — proceed with deploy

If the branch is behind, the deploy does not run. The agent must pull the latest, verify that its own changes still apply cleanly, and only then deploy. This is one line of added friction. The alternative is randomly losing days of work.

Which paths need a claim?

Not every file requires a claim — that would add so much ceremony that agents would slow to a crawl. You want claims on shared surface area: files that multiple agents might edit, whose changes affect other agents' work, and whose corruption or rollback has real consequences.

For us, the list breaks down like this:

Worker code and shared scripts — the edge logic that serves every user request. A bad deploy here affects everything.
Operating rules and soul files — the standing constitution (more on this below). You do not want two agents editing the rules simultaneously.
Database migrations — schema changes are irreversible once applied. Serializing them is essential.
State and handoff documents — the memory files from Part 2 and Part 3. Two agents updating the handoff at the same time produces a merged mess.
Configuration files that affect multiple agents — permissions, environment settings, shared tool configs.

Site-specific content files — individual HTML pages, per-site assets — generally do not need claims if each agent has a clearly scoped section of the site. The danger zone is shared infrastructure, not per-agent work.

The operating-rules constitution

Claims and sync checks are mechanisms. But mechanisms only work if agents use them. And agents use them consistently only if there is a standing written rule that says they must. That is the third piece: a short constitution of operating rules that every agent reads at the start of every session.

We call ours "decrees." The name is internal shorthand; the structure is what matters. Each rule is one clear sentence of behavior, written to eliminate the most common failure mode we had encountered. Here is the stripped-down core that applies to any multi-agent setup:

# Operating Rules — Starter Constitution

Rule 1: VERIFY BEFORE DONE
No task is complete until verified through the real public surface.
A backend 200 is not verification. Seeing the feature work in a browser is.

Rule 2: HONEST REPORTING
Report what you did AND what you did not do. Never claim success
for work that was only partially completed. "Should work" is not a result.

Rule 3: NO PLACEHOLDERS IN PRODUCTION
Nothing that says TODO, TBD, or "coming soon" ships to live pages.
Unpublished beats wrong-and-published.

Rule 4: LOAD MEMORY BEFORE ACTING
Read the state file, the decisions file, and the lessons file
before starting any session. Do not act from memory alone.

Rule 5: CLAIM BEFORE YOU CHANGE
For any shared path, claim it before opening the file.
If the claim is refused, pick a different task. Do not start the edit.

Rule 6: SYNC BEFORE YOU DEPLOY
Verify the local branch is current with the main branch before any deploy.
If behind: pull, verify, then deploy. Never skip this step.

Rule 7: KEEP WORKING THROUGH INTERRUPTIONS
A context reset or a new directive does not abandon in-flight work.
Save state, note the interruption, return to the work when clear.

Seven rules. Each one closes a specific failure mode we had already paid for. None of them is aspirational — each was written after something went wrong.

Key takeaway: Rules beat intentions. Every agent intends to be careful. Intention is context-dependent and fades under time pressure. A written rule that the agent reads at the start of every session does not fade. Write the rule; do not rely on the intention.

How rules propagate across agents

Having the rules written down only helps if every agent actually reads them. Our approach is straightforward: the rules live in a file in the repository, and the startup sequence — the same one that loads the memory files from Part 2 — includes reading the rules file. It is not optional and it is not assumed. It is an explicit step in the protocol every agent follows at session start.

This means that when you update a rule, every agent picks it up the next time it starts. No redeployment, no coordination overhead. You edit the file, commit it, and the fleet is updated on the next session start.

It also means that rule violations are detectable. If an agent deploys without syncing, and you later discover the stale-deploy failure, you can update the rule and the lesson file simultaneously. The rule becomes stricter; the lesson explains why. Every future agent reads both.

The read-only default for workers

One structural choice has reduced coordination conflicts more than any single rule: worker agents are read-only by default. They read files, analyze them, generate output, and return results to the orchestrator. They do not commit. They do not deploy. They do not write to shared state.

The orchestrator is the one that commits and deploys, after reviewing what the workers returned. This keeps the audit trail clean — every durable change traces to a single point of decision — and it means that even if two workers are running in parallel on related tasks, they cannot clobber each other's work, because neither of them is writing shared state.

Workers write only to explicitly scoped scratch paths: a temporary directory for their own work product, not shared infrastructure. The orchestrator integrates, reviews, and decides what to persist. This one pattern eliminates an entire class of coordination failure before it can happen.

Key takeaway: If only one agent commits and deploys, you cannot have a race condition on commits and deploys. The read-only default for workers is the simplest coordination mechanism of all.

The practical setup: what you actually need

If you are running a small operation — one orchestrator, two or three workers — you do not need a complex claims system. Here is the minimum viable version:

A claims.json file in the repository root (or a single-table SQLite database) with path, agent, and expiry fields.
A four-line sync check script that runs before every deploy to shared infrastructure.
The rules constitution above, adapted to your setup, living at a fixed path your agents read at session start.
Workers that write only to their own scoped scratch directories; the orchestrator reviews and commits.

That is the complete coordination layer for a small fleet. You will grow into more formal mechanisms — atomic claim locks, automated expiry sweeps, claim visibility dashboards — only as the fleet grows and the friction of manual coordination becomes the bottleneck. Start minimal. Add complexity only when you have paid for the problem it solves.

Download: Operating-Rules Starter

A fill-in-the-blanks constitution with the seven core rules above, plus three additional slots for rules specific to your setup and a one-page guide on how to update and propagate rules across your agent fleet. Adapt it in an afternoon; your agents read it starting tomorrow.

Get the Operating-Rules Starter

The one action to take today

Before you add a second agent to anything, write a rules file. Five rules minimum. Put it somewhere your agents can read it. Make reading it the first step of every session. The rules do not have to be perfect — they just have to exist. You will refine them as you discover the failure modes they need to close. The only wrong move is running a multi-agent system with no standing rules at all, and discovering the coordination problems after they have already cost you something.

← Part 6: The Quality Gate Part 8: War Stories: The Day the Swarm Hung →