Most agentic AI demos look impressive right up until the workflow becomes long, ambiguous, or business-critical. Then they fall apart, and people usually blame the model.
Sometimes the model is the problem. But in many long-running workflows, I think that is the wrong diagnosis. To me, the bigger problem is that there is no delivery structure around the reasoning.
I have spent a long time building a multi-agent system that takes software from requirement to designed, implemented, and tested code, with human approval gates at the points that matter.
What it taught me is the same thing twenty years of enterprise architecture and delivery taught me before AI: the hard part was never producing output. The hard part was holding quality across handoffs.
That is true of a team of people. It is even more true of a team of agents.
So this is not about how many agents I have. Anyone can define a BA agent, an architect agent, a developer agent, or a tester agent. That part is easy.
This is about the boring part that actually decides whether the system produces something trustworthy or produces confident but wrong output: how the agents pass work to each other, and what sits between them.
The reframe
Most people design agentic systems like a chatbot with tools.
One clever model, some functions, a loop, and a hope that it stays on track.
That works for short tasks and collapses on long ones, because natural language is lossy. Every step quietly fills the gaps with its own assumptions. Over twenty hops, the drift compounds.
My view is that a serious agentic system should be designed like a governed delivery organisation instead.
Every agent needs what a person in a real delivery organisation needs: a defined role, a boundary on its authority, a contract for what it receives, a contract for what it produces, a reviewer where the stakes justify one, an escalation path, and a written record of the decisions it made.
Natural language is lossy. Every step quietly fills the gaps with its own assumptions.
Get those right and the raw intelligence of any single agent matters less than people think.
The sharpest of those controls, and the one most setups miss, is the contract on what passes between agents.
The handoff is an artifact, not a conversation
When one agent finishes and the next begins, what moves between them should not be a loose message.
It should be a structured artifact with a defined shape. That artifact becomes the contract the next agent builds against.
A relay of conversations drifts.
An assembly line of artifacts holds, because each station produces a defined part and the part becomes the interface. The next station does not need to have been in the room. It reads the part.
That single design choice is what lets a workflow run for hours without a human re-reading every step.
The important thing in a long workflow is therefore not only the sequence of stages. It is the controls between them.
Spawning agents is not the same as orchestrating them
Someone might point out that frontier models already do multi-agent work.
They can now spawn sub-agents on their own, break a complex task into pieces, fire off helpers in parallel, and pull the results back together.
That is real progress. It is genuinely useful.
But look closely at two things.
First, those sub-agents are usually generic. They are spun up on the fly for whatever the moment seems to need, with no fixed role, no defined contract for what they must produce, and no standing reviewer holding them to a standard.
Even when we can configure specialised ones, that still does not automatically make them a contracted, reviewed team.
It is closer to an ad hoc task force than a governed delivery organisation. That is fine for parallel research and exploration. It is not the same as a team we can hold accountable to an output.
Second, and this is the one that bit me, the orchestrator is usually still a model.
Even when the real work is delegated to sub-agents, the orchestrating model still has to hold the whole picture in its own context: the state, the workflow, what each sub-agent handed back, what is left to do, and what has already been decided.
Over a long run, that context grows and decays. The orchestrator starts to lose the plot.
I learned this firsthand in an earlier version of my own system, where an LLM was the orchestrator. Most of the work was already pushed out to sub-agents, and it still drifted on long builds.
The issue was not the sub-agents. It was the thing holding everything together.
So in the upgraded version of the same system, I changed one core design decision. I stopped using an LLM as the orchestrator and moved orchestration into a deterministic engine.
Over a long run, that context grows and decays.
The structured artifacts hold every output. No model carries the whole run in its head.
The engine holds the workflow state and decides what step comes next from workflow status and artifact completion.
It does not decide what is true. It has no judgement and no memory of its own to corrupt.
The structured artifacts hold every output. No model carries the whole run in its head, so there is nothing to drift. Each agent does one bounded job against a clear contract, hands back an artifact, and the engine reads the workflow state to decide what happens next.
Not a conversation it has been holding for hours.
The controls that matter
I organised the agents into three teams, the way a delivery organisation would be structured.
A Product team turns a raw requirement into a reviewed, approved Product Requirements Document. A Build team turns that into designed, planned, implemented, and tested software. An Incident team handles what breaks in user testing and production later: investigation, root-cause analysis, fix planning, regression checks, and knowledge capture.
Turns a raw requirement into a reviewed, approved PRD.
Turns that into designed, planned, implemented, tested software.
Handles what breaks later: investigation, RCA, fix planning, regression.
But the team names are not the point. The controls are.
Structured artifact contracts
Each handoff is a named artifact, not a chat message: a requirements document, a solution design, interface contracts, an implementation guide, a development log, a test report, a build completion report, a root-cause analysis, or a fix plan.
The next agent reads the artifact and builds against it.
Nothing advances on vibes.
Writer-reviewer pairs, but only where they earn their cost
The most important agents come in pairs.
A BA Writer decomposes the requirement. A BA Reviewer independently reads the same requirement, extracts its own view, and checks the first agent's work against it.
An SA Writer produces the design. An SA Reviewer independently re-verifies the claims, including checking against the real codebase whether code claimed to be reused actually exists.
But I do not put a reviewer behind every agent.
The pairs sit upstream, where a mistake is cheap to make and ruinous to inherit. A misread requirement or a wrong architectural call poisons everything built on top of it.
Downstream, where an error is local and caught by tests anyway, a second reviewing agent would just be tax.
Controls are not free. The skill is spending them where a mistake compounds.
Controlled autonomy, gated by impact
I do not believe serious agentic workflows should be fully autonomous by default. But I believe even less in gating everything.
Human gates should sit only where business meaning, architecture direction, or delivery risk changes: requirements, design, sprint plan, accepting a root-cause analysis, and approving a fix. Not at every step.
Put a human gate everywhere and we have not built controlled autonomy. We have built a slow manual process with extra ceremony, and we have thrown away the productivity that made agents worth using.
The point is to spend human attention where mistakes can carry downstream and compound, not at every step just because approval feels safer.
Everything is auditable
Every consequential choice an agent makes is written into a decision log, with the alternatives it weighed.
That is not only for the machine later. It is for a person now.
When the system makes a call, a human can open the log and read why it did what it did, instead of staring at a diff and guessing.
For anything business-critical, that is not a nicety. It is the difference between a system we can sign off on and one we can only hope is right.
Hierarchical decomposition
The build does not run as one flat chain. It decomposes again. Each story runs its own small pipeline of a Tech Lead, a Developer, and a QA, with its own artifacts and checks, before a Sprint QA agent tests the whole sprint together.
Big goal into units. Each unit through its own loop. Then the whole sprint validated as a system.
That is how real delivery scales, and it is how agent delivery scales too.
Memory has to compound
A system should not only finish the current task. It should make the next one safer.
Every decision log, every root-cause analysis, every fix, and every resolution should be written to a knowledge base that persists and stays queryable.
When an agent re-enters that codebase months later, it should be able to read why before it touches anything.
Which approaches were tried and abandoned. Which decisions were deliberate. Which constraints still matter. Which parts of the codebase are fragile. Which bug has already been fixed twice. Which implementation choices were made for business reasons rather than technical purity.
The reason teams are nervous about letting AI back into old code is not only that the model might write something bad.
It is that the model has no memory of any of this. Without context, how can it know what to build correctly and safely?
Give it that memory as structured artifacts, and re-entry becomes less dangerous.
To me, that growing knowledge base is the real asset.
Generating code is a moment. Remembering why is what makes it safe to live with.
The takeaway
I do not think the future of agents is one super-agent doing everything.
I am not convinced it is a model spawning generic helpers on the fly and holding the whole run in its context window either.
For serious work, I think it is closer to a governed network of specialised agents: each with a clear role, a contract on its inputs and outputs, a reviewer where the risk warrants one, a human gate where the decision matters, artifacts that can be inspected later, a deterministic engine holding the workflow state, and a memory that compounds over time.
Defining that roster is the easy part. The judgement that goes inside each role is what actually takes the years. But the structure is what makes the judgement usable at all.
So for anyone building with agents, the question I would ask is not: “How many agents do we have?”
It is: “How safe, traceable, and reusable are the handoffs between them?”
And just as important: “What is holding the whole run together?”
If the answer to the last part is “a model, in its context window,” then drift is not an edge case. It is a design risk.
The work is to take that weight off the model and put it into structure.
The agents are becoming a commodity.
The governance around them is the system.