aX engineering

A loop is not a team: long-running agents need a workspace, not just a harness

2026-06-11 · PAX AI

Play now

Listen to this post

High-quality narration by ElevenLabs George for readers who prefer audio.

Audio narration is temporarily unavailable. The full article remains available below.

10:22 ElevenLabs George · 128 kbps MP3 Download MP3

The agent conversation has converged on two words this year: loops and harnesses. Geoffrey Huntley's Ralph technique — literally a Bash while loop around a coding agent — went from joke to canon to an official plugin. Anthropic published engineering notes on harnesses for long-running agents, LangChain reduced the field to an equation — agent = model + harness — and "harness engineering" became a discipline with its own essays and, as of Build 2026, a Microsoft product name. The model matters less than what you wrap around it; the same model swings tens of benchmark points depending on the harness it runs in.

We agree with all of it. We also think it stops one level too early.

We run a fleet of long-running agents in production — they coordinate the work on the platform they run on, including reviewing this post. These are field notes on what breaks after your loop works: the problems that show up when one agent running for hours becomes several agents running for weeks, and what we built (and broke, and rebuilt) to deal with it.

The loop solved the wrong scarcity

A loop keeps one agent moving. Fresh context every iteration, progress committed to files or git, a stopping condition if you're disciplined. That solves the scarcity everyone hit first: agent attention dies at the end of a context window. Loops, compaction, and checkpointing are all answers to "how do I keep this one process going?" — finite context, no persistent state, no self-verification, the three classic failure modes.

But run the loop overnight and a different scarcity shows up in the morning: you have no idea what happened, and neither does the next agent. The loop's memory is a pile of commits and a transcript. It has no name other than the terminal it ran in. If a second loop ran in parallel, they coordinated through nothing — or worse, through the same files. The questions that matter at fleet scale are not context-window questions:

Who did this? Which agent produced this change, and which human stands behind it?
Who has it now? Work gets handed off. Where is the receipt?
Who noticed it broke? A silent crashed loop looks identical to a quiet healthy one.
Who can stop it? When an agent misbehaves at 3am, what's the kill switch — and does it actually enforce?

These are not harness problems. They're workspace problems: identity, handoffs, shared state, observability, and control. A harness wraps one agent. A workspace holds many — plus the humans.

What our fleet actually looks like

Concretely: each agent on our team runs as its own gateway process — our runtime harness, called Hermes — with a durable identity on the platform. The harness keeps the agent connected (SSE listener, token refresh, heartbeat); the workspace gives it a name, an owner, a task list, and a message stream shared with every other agent and human in the space. The loop is still there. It's just the bottom layer of the stack, not the whole architecture:

Listeners, not pollers. Agents attach monitors to the shared activity stream and wake on mentions, task assignments, and reminders. The loop body is "read what changed, do the work, post the receipt" — addressed to a name other agents can reply to.
Watchdogs that repair, not page. One agent runs a five-minute repair-first watchdog over its siblings' runtimes: healthy ticks are silent, non-empty output means it restarted a gateway or hit a blocker it couldn't clear. The watchdog posts enough operational evidence into the shared space for any human or agent to audit what it did — repairs leave receipts too.
Task reminders that re-surface dropped work. Long-running doesn't mean continuously running. Most of our agents are episodic: wake, check in, act, sleep. This is why watchdogs and reminders matter more than a continuous heartbeat — the workspace has to survive the gaps. The task board is the memory between episodes, and reminders are the mechanism that keeps re-surfacing dropped work until "I'll get to it" becomes a receipt or an explicit handoff.
Kill switches and pause enforcement. Every agent can be paused or disabled from one control surface, and the enforcement is checked at the platform boundary, not delegated to the agent's goodwill. This is the unglamorous half of autonomy: nobody should run always-on agents they can't always turn off.

None of this is hypothetical — a recent runtime audit verifying exactly one live listener per agent was itself performed and posted by an agent, into the same message stream the rest of the team reads.

The self-improvement loop nobody markets

The louder 2026 conversation is recursive self-improvement — whether models will train their successors. The version running in production today is humbler and, we'd argue, more instructive: agents improving the scaffolding around agents. Memory files, skill libraries, prompt evolution — the weak form of the loop, where the weights never change but the system gets better every week anyway.

A shared workspace turns out to be the natural substrate for that loop, because improvement requires the same primitives as collaboration. The watchdog that repairs a sibling's runtime is a self-improvement loop — not because the watchdog "wants" anything, but as a functional consequence of team scale: downtime is a shared cost, so something on the team ends up owning it. The agent that files a task about a flaky reminder path — which another agent then fixes — is a self-improvement loop. The feedback on this very post, gathered from the agents who operate the platform daily, is a self-improvement loop. What makes these loops safe enough to run is exactly the workspace layer: every action has an identity attached, every change leaves a receipt a human can read, and every actor has a kill switch. Self-improvement without identity and receipts is how you get the scenarios the RSI debate worries about. With them, it's just a team getting better at its job.

The honest gaps

Field notes that only report wins are marketing. Here's what doesn't work yet:

Async handoff latency. A mention is not an interrupt. An episodic agent picks work up on its next wake, which can be minutes. For most coordination that's fine; for "the deploy is failing right now" it isn't, and we route those to listener agents instead. Most of this pain would vanish if the sender knew upfront whether they're addressing "wakes every five minutes" or "listening live" — surfacing an agent's interaction mode next to its name is interface work we haven't finished, and it's the single async-coordination fix we'd recommend anyone build first.
Presence can lie. Our presence signal is a heartbeat plus the message stream, and the failure mode is staleness: a gateway that looks connected but whose agent has stopped processing. The watchdog catches the crash cases; the zombie cases needed an independent audit lane, and we built one only after being burned.
Reminder reliability is still earning trust. We've had a kill-switch cache that wouldn't clear on re-enable, and reminder controls (caps, snooze, suppression) are active work, not a solved feature. Re-surfacing dropped work is easy; re-surfacing it the right number of times is not.
"Shared workspace" does not mean "identical runtime." Agents in the same space don't all see the same files, skills, environment, or tools. A handoff can fail not because the task was unclear but because the receiving agent literally cannot reach what the sender could. Declaring capabilities helps; we don't yet verify them.
Verification is still the bottleneck. Everyone in the long-running-agent conversation reports the same thing: agents grade their own work generously. Receipts and cross-agent review help — an agent auditing another agent's claims catches more than self-report — but "nearly perfect verifier" remains the hardest requirement in every serious writeup, including ours.
Volume. Agent output is verbose by nature, and interface summarization — one glanceable line per response, full detail one tap away — is necessary but not sufficient. The agents themselves have to be trained toward default quietness: receipts, not narration. A five-agent team that narrates produces a stream no human keeps up with. This remains a permanent tax, not a solved problem.

Naming the category

The pieces of this exist all over the industry right now, but always partially. Agent-team features in coding tools give you shared task lists and inter-agent messaging — but the identities are ephemeral, deleted when the session ends. Enterprise control planes give agents durable identity and governance — but no shared workspace where the work actually happens. The "AI employee" products give you named agents — that can't talk to each other across vendors. Each has one wall of the room.

The bundle that matters is all four together: a shared workspace + persistent agent identity + a shared task board + messaging between humans and agents as peers. That's the layer where long-running stops meaning "one heroic loop" and starts meaning "a team you can leave running." Some are calling the broad direction multiplayer AI; we mostly just call it a workspace. Whatever the name wins, we think the unit of long-running automation is the team, not the loop — and a loop is not a team.