Perspective

AI agents should be governed like a delivery organisation, not prompted like a chatbot

Serious agentic systems do not fail only because the model is weak. They fail because the work is not structured: handoffs are loose, assumptions drift, orchestration lives inside context, and nobody can inspect why the system made a decision.

Most agentic AI demos look impressive right up until the workflow becomes long, ambiguous, or business-critical. Then they fall apart, and people usually blame the model.

Sometimes the model is the problem. But in many long-running workflows, I think that is the wrong diagnosis. To me, the bigger problem is that there is no delivery structure around the reasoning.

I have spent a long time building a multi-agent system that takes software from requirement to designed, implemented, and tested code, with human approval gates at the points that matter.

What it taught me is the same thing twenty years of enterprise architecture and delivery taught me before AI: the hard part was never producing output. The hard part was holding quality across handoffs.

That is true of a team of people. It is even more true of a team of agents.

So this is not about how many agents I have. Anyone can define a BA agent, an architect agent, a developer agent, or a tester agent. That part is easy.

This is about the boring part that actually decides whether the system produces something trustworthy or produces confident but wrong output: how the agents pass work to each other, and what sits between them.

The reframe

Most people design agentic systems like a chatbot with tools.

One clever model, some functions, a loop, and a hope that it stays on track.

That works for short tasks and collapses on long ones, because natural language is lossy. Every step quietly fills the gaps with its own assumptions. Over twenty hops, the drift compounds.

My view is that a serious agentic system should be designed like a governed delivery organisation instead.

Every agent needs what a person in a real delivery organisation needs: a defined role, a boundary on its authority, a contract for what it receives, a contract for what it produces, a reviewer where the stakes justify one, an escalation path, and a written record of the decisions it made.

The reframe
Chatbot with tools short tasks
One clever model
Some functions
A loop
A hope it stays on track

Natural language is lossy. Every step quietly fills the gaps with its own assumptions.

Over 20 hops, the drift compounds
Governed delivery organisation long, critical work
RoleA defined job and a boundary on its authority
ContractsA shape for what it receives and what it produces
Review & escalationA reviewer where stakes justify one, and a path up
RecordA written log of the decisions it made
The raw intelligence of any single agent matters less than people think
A clever model in a loop holds for short tasks. Long, business-critical workflows need the structure of a delivery organisation.

Get those right and the raw intelligence of any single agent matters less than people think.

The sharpest of those controls, and the one most setups miss, is the contract on what passes between agents.

The handoff is an artifact, not a conversation

When one agent finishes and the next begins, what moves between them should not be a loose message.

It should be a structured artifact with a defined shape. That artifact becomes the contract the next agent builds against.

A relay of conversations drifts.

An assembly line of artifacts holds, because each station produces a defined part and the part becomes the interface. The next station does not need to have been in the room. It reads the part.

The handoff
A relay of conversations
“…what I meant” “…what I think they meant” “…close enough” “ ? ”
Each hop fills the gaps with its own assumptions. Drift compounds
An assembly line of artifacts
ProductRequirements doc DesignSolution design ContractInterface contracts VerifyTest report
Each part is the interface. The next station just reads it
A relay of conversations drifts. An assembly line of artifacts holds. Each station produces a defined part, and the part becomes the contract.

That single design choice is what lets a workflow run for hours without a human re-reading every step.

The important thing in a long workflow is therefore not only the sequence of stages. It is the controls between them.

Spawning agents is not the same as orchestrating them

Someone might point out that frontier models already do multi-agent work.

They can now spawn sub-agents on their own, break a complex task into pieces, fire off helpers in parallel, and pull the results back together.

That is real progress. It is genuinely useful.

But look closely at two things.

First, those sub-agents are usually generic. They are spun up on the fly for whatever the moment seems to need, with no fixed role, no defined contract for what they must produce, and no standing reviewer holding them to a standard.

Even when we can configure specialised ones, that still does not automatically make them a contracted, reviewed team.

It is closer to an ad hoc task force than a governed delivery organisation. That is fine for parallel research and exploration. It is not the same as a team we can hold accountable to an output.

Second, and this is the one that bit me, the orchestrator is usually still a model.

Even when the real work is delegated to sub-agents, the orchestrating model still has to hold the whole picture in its own context: the state, the workflow, what each sub-agent handed back, what is left to do, and what has already been decided.

Over a long run, that context grows and decays. The orchestrator starts to lose the plot.

I learned this firsthand in an earlier version of my own system, where an LLM was the orchestrator. Most of the work was already pushed out to sub-agents, and it still drifted on long builds.

The issue was not the sub-agents. It was the thing holding everything together.

So in the upgraded version of the same system, I changed one core design decision. I stopped using an LLM as the orchestrator and moved orchestration into a deterministic engine.

What holds the run together
A model as orchestrator context window
Holds the state
Holds the workflow
Holds every sub-agent's hand-back
Holds what has already been decided
Holds what is left to do

Over a long run, that context grows and decays.

The orchestrator starts to lose the plot
A deterministic engine nothing to corrupt
ReadsWorkflow status + artifact completion
DecidesWhat step comes next, nothing more
Holds no viewNo judgement, no memory of its own to corrupt
AgentsEach does one bounded job, hands back an artifact

The structured artifacts hold every output. No model carries the whole run in its head.

So there is nothing to drift
Move orchestration off the model and into a deterministic engine. The engine holds workflow state; the artifacts hold every output; no model carries the whole run in its head.

The engine holds the workflow state and decides what step comes next from workflow status and artifact completion.

It does not decide what is true. It has no judgement and no memory of its own to corrupt.

The structured artifacts hold every output. No model carries the whole run in its head, so there is nothing to drift. Each agent does one bounded job against a clear contract, hands back an artifact, and the engine reads the workflow state to decide what happens next.

Not a conversation it has been holding for hours.

The controls that matter

I organised the agents into three teams, the way a delivery organisation would be structured.

A Product team turns a raw requirement into a reviewed, approved Product Requirements Document. A Build team turns that into designed, planned, implemented, and tested software. An Incident team handles what breaks in user testing and production later: investigation, root-cause analysis, fix planning, regression checks, and knowledge capture.

Three teams
Product

Turns a raw requirement into a reviewed, approved PRD.

Artifacts
Requirements docReview notes
Build

Turns that into designed, planned, implemented, tested software.

Artifacts
Solution designImpl guideDev logTest report
Incident

Handles what breaks later: investigation, RCA, fix planning, regression.

Artifacts
Root-cause analysisFix planKnowledge capture
Between every team a named artifact, not a chat message, that the next team builds against.
Structured like a delivery organisation: Product, Build, Incident. The team names are not the point. The controls between them are.

But the team names are not the point. The controls are.

Structured artifact contracts

Each handoff is a named artifact, not a chat message: a requirements document, a solution design, interface contracts, an implementation guide, a development log, a test report, a build completion report, a root-cause analysis, or a fix plan.

The next agent reads the artifact and builds against it.

Nothing advances on vibes.

Writer-reviewer pairs, but only where they earn their cost

The most important agents come in pairs.

A BA Writer decomposes the requirement. A BA Reviewer independently reads the same requirement, extracts its own view, and checks the first agent's work against it.

An SA Writer produces the design. An SA Reviewer independently re-verifies the claims, including checking against the real codebase whether code claimed to be reused actually exists.

But I do not put a reviewer behind every agent.

Reviewers where they earn their cost
Product · upstreamRequirement
BA Writer
BA Reviewer
A misread poisons everything above it
Build · upstreamDesign
SA Writer
SA Reviewer
A wrong call compounds downstream
Build · downstreamImplementation
Developer
Local · caught by tests
Build · downstreamTesting
QA
Local · self-checking
Mistake compoundsMistake stays local
Writer-reviewer pairs sit upstream, where a mistake is cheap to make and ruinous to inherit. Downstream, a second reviewing agent would just be tax.

The pairs sit upstream, where a mistake is cheap to make and ruinous to inherit. A misread requirement or a wrong architectural call poisons everything built on top of it.

Downstream, where an error is local and caught by tests anyway, a second reviewing agent would just be tax.

Controls are not free. The skill is spending them where a mistake compounds.

Controlled autonomy, gated by impact

I do not believe serious agentic workflows should be fully autonomous by default. But I believe even less in gating everything.

Human gates should sit only where business meaning, architecture direction, or delivery risk changes: requirements, design, sprint plan, accepting a root-cause analysis, and approving a fix. Not at every step.

Human gates, by impact
Requirements Design Sprint plan Accept the RCA Approve the fix Every other step runs on its own
Gates sit only where business meaning, architecture, or delivery risk changes. Put a human at every step and we have rebuilt a slow manual process with extra ceremony.

Put a human gate everywhere and we have not built controlled autonomy. We have built a slow manual process with extra ceremony, and we have thrown away the productivity that made agents worth using.

The point is to spend human attention where mistakes can carry downstream and compound, not at every step just because approval feels safer.

Everything is auditable

Every consequential choice an agent makes is written into a decision log, with the alternatives it weighed.

That is not only for the machine later. It is for a person now.

When the system makes a call, a human can open the log and read why it did what it did, instead of staring at a diff and guessing.

For anything business-critical, that is not a nicety. It is the difference between a system we can sign off on and one we can only hope is right.

Hierarchical decomposition

The build does not run as one flat chain. It decomposes again. Each story runs its own small pipeline of a Tech Lead, a Developer, and a QA, with its own artifacts and checks, before a Sprint QA agent tests the whole sprint together.

Big goal into units. Each unit through its own loop. Then the whole sprint validated as a system.

That is how real delivery scales, and it is how agent delivery scales too.

Memory has to compound

A system should not only finish the current task. It should make the next one safer.

Every decision log, every root-cause analysis, every fix, and every resolution should be written to a knowledge base that persists and stays queryable.

When an agent re-enters that codebase months later, it should be able to read why before it touches anything.

Memory compounds
Decision logs Root-cause analyses Fixes & resolutions Deliberate constraints
Knowledge base
Persists · stays queryable
On re-entry, months later
Read why before touching anything
What was tried and abandoned Which decisions were deliberate Which parts are fragile Which bug has been fixed twice
Generating code is a moment. Remembering why is what makes it safe to live with.

Which approaches were tried and abandoned. Which decisions were deliberate. Which constraints still matter. Which parts of the codebase are fragile. Which bug has already been fixed twice. Which implementation choices were made for business reasons rather than technical purity.

The reason teams are nervous about letting AI back into old code is not only that the model might write something bad.

It is that the model has no memory of any of this. Without context, how can it know what to build correctly and safely?

Give it that memory as structured artifacts, and re-entry becomes less dangerous.

To me, that growing knowledge base is the real asset.

Generating code is a moment. Remembering why is what makes it safe to live with.

The takeaway

I do not think the future of agents is one super-agent doing everything.

I am not convinced it is a model spawning generic helpers on the fly and holding the whole run in its context window either.

For serious work, I think it is closer to a governed network of specialised agents: each with a clear role, a contract on its inputs and outputs, a reviewer where the risk warrants one, a human gate where the decision matters, artifacts that can be inspected later, a deterministic engine holding the workflow state, and a memory that compounds over time.

Defining that roster is the easy part. The judgement that goes inside each role is what actually takes the years. But the structure is what makes the judgement usable at all.

So for anyone building with agents, the question I would ask is not: “How many agents do we have?”

It is: “How safe, traceable, and reusable are the handoffs between them?”

And just as important: “What is holding the whole run together?”

If the answer to the last part is “a model, in its context window,” then drift is not an edge case. It is a design risk.

The work is to take that weight off the model and put it into structure.

The agents are becoming a commodity.

The governance around them is the system.

See it in practice

This is how Fostery is built.

Structured artifact handoffs, writer-reviewer pairs, a deterministic engine, and a knowledge base that compounds. The governed process Fostery runs to build real software, now in closed beta.