Multi-agent orchestration patterns: a 2026 production guide
Six multi-agent system patterns that actually ship in 2026 — supervisor, swarm, hierarchical, blackboard, sequential, hybrid — with framework picks and the production failure modes nobody warns you about.
A multi agent system is a software architecture where two or more autonomous language-model agents coordinate to solve a task that one agent cannot, or should not, handle alone. Each agent has its own prompt, its own tool set, often its own model choice, and a defined way of passing work to another agent. The supervisor pattern routes through a single coordinator; the swarm pattern lets peers hand off directly; the hierarchical pattern stacks supervisors. That is the whole idea, and in 2026 it is one of the most over-prescribed shapes in applied AI.
We've shipped enough agent systems to think the conversation around them is upside down. The interesting question isn't whether you can wire three Claude Sonnet 4.6 agents into a CrewAI graph. You can, in an afternoon. The interesting question is whether you should, what the production failure modes look like once the demo is live, and which of the six orchestration patterns actually fits the work you're trying to automate. That's what this post covers, written for the Tech Lead, VP Eng, or CTO who has to pick a pattern and own the on-call rotation when something breaks at 3am.
Multi agent system, in one minute
Working definition. A multi agent system is a runtime in which two or more LLM-driven agents exchange messages, share a workspace, and call tools, in order to complete a task that has been decomposed across them. Each agent is a prompt plus a model plus a tool set plus a memory of the conversation so far. The orchestrator is the code that decides who runs next, what they see, and when the system has reached an answer. That orchestrator is the architectural choice you're making when you pick LangGraph or CrewAI or AutoGen or the OpenAI Agents SDK or the Anthropic Agent SDK, or a custom state machine you wrote in TypeScript.
What's actually shipping in 2026. Most production agent systems we see in the wild settle on a supervisor or a sequential pipeline; a smaller but real cohort runs hierarchical (supervisors of supervisors) for genuinely complex workflows; the swarm and blackboard shapes show up in research-and-summarise tasks where parallel exploration pays. The model layer is split between Claude Opus 4.7 for the planner role, Claude Sonnet 4.6 or GPT-5 mini for the worker roles, and Claude Haiku 4.5 or GPT-5 nano for routing classifiers. Token spend is dominated by the supervisor's growing context window, not by the worker calls, and that one fact drives more of the unit economics than any framework choice.
Throughout this piece we'll use multi agent system as the umbrella for any architecture where more than one LLM-driven agent participates in the same task. Where the pattern actually changes the engineering tradeoffs (cost and latency and failure modes), we'll say which pattern and why.
Multi agent system architecture: the five components every pattern shares
Every multi agent system architecture, regardless of framework, reduces to five components. The agents themselves: each is a prompt template, a model binding, and a tool list. The message bus or graph edge: how agent A's output reaches agent B's input. Shared state: the workspace or scratchpad or graph-state object that survives across handoffs. The tool registry: what external functions are callable, and by whom. And the orchestrator: the code that picks the next agent, enforces handoff rules, and decides termination.
The split between message bus and shared state is the part most teams get wrong on a first build. LangGraph keeps them separate: the graph defines edges (the bus), and a typed state object travels between nodes (the state). CrewAI fuses them inside its Crew object, which is convenient until you need to inspect what one agent saw and another didn't. AutoGen's group-chat abstraction puts everything in a single conversation buffer, which is easy to reason about for the first three agents and an observability nightmare for the next seven. Pick the framework whose split matches how you want to debug at 3am, not the one with the prettiest landing page.
Tool registry design is where security and cost both live. A naive multi agent system architecture hands every agent every tool. A production one scopes tools per role: the planner can call the search tool and the database read tool; only the writer agent can call the email-send tool; nobody but the auditor can call the destructive delete tool. The same pattern that scopes IAM in a real backend service applies here, and skipping it is what turns an agent demo into a security finding. We talked through the same logic on our pillar around ai agent development company engagements; the principle there is identical.
Shared state is the second place careful design pays off. The cheapest shared state is a Python dict in process memory and that works fine for a graph that completes in seconds on a single worker. The moment the work spans a queue or a Temporal activity, the state has to live outside the process; Postgres is the boring default and the right one for almost every team we work with. Redis is fine for short-lived caches but a poor primary store for an agent system because you actually want history. Most teams write a thin state-store wrapper inside their orchestrator code so the swap from in-memory to Postgres is a one-line change when the system grows up.
The six orchestration patterns you'll actually use
There are dozens of named patterns in the research literature; six of them cover almost every production system we've shipped or reviewed. Supervisor: one coordinator routes work to N specialist subagents and aggregates their replies. Hierarchical: supervisors of supervisors, where a top-level planner delegates entire sub-tasks to mid-level supervisors that own their own teams. Swarm or network: peer agents hand off to each other directly, no central coordinator, with a termination rule. Blackboard: agents read and write a shared structured workspace, picking up work when their precondition is met. Sequential pipeline: a fixed chain of agents, each passing output to the next, with no branching. Hybrid: a real system that combines two of the above, almost always supervisor-plus-pipeline.
| Supervisor | Swarm / network | |
|---|---|---|
| Coordinator | Single central planner | None; peer-to-peer handoff |
| Best for | Heterogeneous specialists | Embarrassingly parallel research |
| Token cost | Linear in subagents + supervisor context growth | Quadratic if peers broadcast |
| Debuggability | High — one trace tree | Low — graph of handoffs |
| Failure mode | Supervisor context window OOM | Infinite handoff loop |
| Default framework | LangGraph supervisor | OpenAI Swarm / Agents SDK |
When to pick each. Reach for supervisor when work decomposes cleanly into specialist roles (researcher and drafter and fact-checker and editor). Reach for hierarchical only when a single supervisor's context window can't hold the full task plan; this is rarer than people think now that Claude Opus 4.7 sits comfortably in the long-context tier. Reach for swarm when the task is genuinely parallel, the subagents don't need each other's intermediate outputs, and you have a hard termination rule (count or time or consensus). Reach for blackboard when the system has long-running asynchronous work and the agents are actually different services owned by different teams. Reach for sequential pipeline when the steps are well-defined and deterministic and you don't actually need agentic reasoning between them; in that case you're better off with a plain workflow runtime and a few LLM calls inside it.
Hybrid deserves a closer look because it's where most production systems actually land. A common shape is supervisor-plus-pipeline: a top-level supervisor that picks a sub-workflow for the incoming request, and inside each sub-workflow a deterministic sequential pipeline does the work without any further agentic decisions. That shape gets you the routing flexibility of an agent plus the predictability of a workflow; the trace is still readable. Another common shape is supervisor-plus-swarm for deep-research-style tasks: the planner runs as a supervisor and the research step inside it fans out to a small swarm of parallel researchers that the planner then summarises. The trick with hybrid is to keep the boundary between the two patterns explicit in the code; a hybrid that quietly drifts into a fourth-pattern hybrid-of-hybrids ends up impossible to reason about.
A note on naming. Different framework communities use different words for similar shapes. LangGraph calls its default the 'supervisor' graph; CrewAI calls it 'hierarchical process'; AutoGen calls it 'group chat with a manager'; the OpenAI cookbook talks about 'orchestrator-worker'. They're all the same pattern with different ergonomics. Don't get attached to the framework's vocabulary if it confuses your team; pick a name internally and document it once and use it everywhere.
Blackboard in the wild: supply-chain agents at a Fortune-500 logistics shop
The cleanest public example of a blackboard multi agent system in production today is the wave of supply-chain optimisation rollouts running on top of Temporal at large logistics shops. Each warehouse, freight lane, and demand-forecast region is wrapped in its own agent, and they all read and write a structured shared workspace that holds the current best-known plan. A planning agent posts a draft schedule; warehouse agents check feasibility against their local constraints and post objections; a reconciler agent picks up when enough objections land and re-runs the plan. The pattern works here for one reason: the agents are owned by different teams with different release cadences, and forcing them through a single supervisor would have created a coordination bottleneck nobody wanted to own. It's classic blackboard — Hearsay-II from the 1980s, just with LLMs reading and writing JSON instead of speech tokens — and the production write-ups from JPMorgan, Walmart Labs, and a handful of European 3PLs all describe broadly the same shape. We've reviewed one engagement that matched this shape and we'd reach for it again under the same constraints.
Sequential pipeline in the wild: sales-ops research at high-velocity B2B teams
The canonical sequential pipeline shipping in 2026 is the sales-ops research stack that's become standard at high-velocity B2B teams — Clay, Apollo's agent layer, the various CrewAI-templated stacks that the GTM-engineering crowd shares on LinkedIn every other week. The shape is always three agents in a fixed order: an account researcher that pulls firmographics and a recent-news scan, a signal hunter that scores buying-intent cues from job postings and tech-stack changes, and a briefer that turns the upstream output into a short, formatted memo for an account executive. It's a pipeline because the order is deterministic and the agents don't argue with each other; each one's output is the next one's input, and the only branching is a quality gate that kicks back to the researcher if the firmographic block is empty. CrewAI's `Process.sequential` was effectively designed for this shape, and it's the one place we don't push back on a multi-agent ask — the role split genuinely pays because the prompts and the tool sets really are different per step.
Hybrid in the wild: deep-research products from Anthropic, OpenAI, and the open clones
Hybrid is the pattern that ships in every deep-research product you've seen this year. Anthropic's Research feature for Claude, OpenAI's Deep Research mode, Perplexity's Research tier, and the open clones (Anthropic's cookbook, the Open Deep Research repo, LangChain's reference template) all converge on the same supervisor-plus-swarm shape: a planner reads the user's question, decomposes it into a handful of independent sub-queries, fans those out to parallel research workers, and then runs a single writer agent over the gathered summaries. The reason every team lands here independently is structural — the planning and the writing are sequential and stateful and want a single agent, but the research step is embarrassingly parallel and wants a swarm, and forcing either step into the other's pattern wastes either latency or coherence. The trap, which we see in roughly a third of the open clones, is letting the workers also do follow-up planning; once that happens you've drifted into a fourth-pattern hybrid-of-hybrids and the trace becomes unreadable. Keep the supervisor's job and the swarm's job strictly separated and the shape holds up under production load.
Multi agent system examples: what production teams are shipping in 2026
It's easier to argue about patterns when you can point at concrete builds. Here are six multi agent system examples drawn from the public literature, vendor case studies, and the engineering conversations our team has had this year. None are paiteq client outcomes; they're industry-shape references.
| Domain | Agent count | Framework | Pattern | Year shipped |
|---|---|---|---|---|
| Customer support triage | 4 (router, retriever, drafter, escalator) | LangGraph | Supervisor | 2024–2025 |
| Code generation (IDE agent) | 3 (planner, coder, tester) | Custom + OpenAI Agents SDK | Hierarchical | 2025 |
| Deep research | 5–8 (planner + parallel researchers + writer) | Anthropic Agent SDK | Supervisor + swarm hybrid | 2025 |
| Supply chain optimisation | 10+ (one per warehouse) | Custom on Temporal | Blackboard | Classic, refreshed 2024 |
| Sales-ops research | 3 (account researcher, signal hunter, briefer) | CrewAI | Sequential pipeline | 2024 |
| Trading-floor analysis | 4 (data, technical, fundamental, risk) | Custom + AutoGen | Supervisor | 2025 |
Two patterns to call out from this list. The deep-research stack (Anthropic's Claude research feature, and the open clones that followed) is the canonical case where swarm-style parallel exploration genuinely pays — the subtasks are independent, the writer agent only needs the summaries, and the wall-clock saving from running researchers in parallel is the whole product. The customer-support stack is the opposite: a tight supervisor with three or four specialised workers, where the supervisor's role is to keep context lean and route deterministically. The framework choice (LangGraph vs CrewAI vs the bare Agents SDKs) is much less interesting than the pattern choice.
Best multi agent system framework: a six-way decision matrix
There isn't a best multi agent system framework in the abstract. There's a best framework for a given pattern, team size, and observability requirement. The honest answer to the question we get most often (which framework should we use?) is a four-dimension judgement: latency cost, observability maturity, learning curve, and pattern fit.
| Framework | Latency cost | Observability | Learning curve | When to pick |
|---|---|---|---|---|
| LangGraph | Low overhead | Strong via LangSmith | Medium; state graph mental model | Supervisor or hierarchical with explicit state |
| CrewAI | Low; crew wrapper is thin | Improving; native traces in 1.x | Lowest; role-task vocabulary clicks fast | Sequential pipeline or small supervisor |
| AutoGen | Higher; group chat fans tokens | Medium; Studio helps but is research-grade | Medium; group chat is easy, beyond is not | Research prototypes, group chat patterns |
| OpenAI Agents SDK | Low if all-in on OpenAI | Tight with OpenAI traces UI | Lowest if your team is already OpenAI-native | Swarm and small supervisor inside OpenAI |
| Anthropic Agent SDK | Low; computer-use is the slow part | Growing; tool-call traces are first-class | Low; ergonomic Python | Claude-led supervisor with computer-use tools |
| Temporal as orchestrator | Higher control-plane cost | Industrial-strength; workflow history is gold | Highest; you're learning Temporal too | Long-running, multi-day workflows with humans |
Two notes on the matrix that don't fit in cells. First, the OpenAI and Anthropic Agent SDKs aren't strict alternatives to LangGraph or CrewAI; they're a layer below. You can wrap either SDK inside a LangGraph node and get the best of both. Most production systems end up as a thin LangGraph (or CrewAI) outer loop around vendor-SDK inner calls. Second, Temporal isn't really a multi-agent framework at all; it's a durable workflow runtime. When the workflow spans hours or days and has to survive worker crashes, picking Temporal and putting agent calls inside its activities is the right answer and the multi-agent question becomes a workflow design question instead.
Where LangGraph hurts. The pain point we hit most often on LangGraph builds is the typed-state object that's beautiful at sprint one and a refactor headache by sprint four. A typical engagement shape: the team starts with a clean `AgentState` TypedDict carrying a `messages` list and a `next` field, then adds a `scratchpad`, then a per-agent `memory` dict, then a `tool_outputs` cache to keep retries cheap, and within a month the state object is fifteen fields deep with implicit invariants nobody documented. LangGraph itself is fine — the graph compiles, the runtime is fast, the LangSmith traces are excellent — but the cost of changing the state shape grows roughly cubically because every node has to be re-read and every saved checkpoint becomes a migration problem. Our default mitigation is a hard rule that the state object owns no more than seven fields, with everything else demoted to a side-cache keyed by run-id; we've watched teams who skipped that rule lose a sprint per quarter to state-shape refactors. If the state really needs to grow, that's a signal to split the graph in two, not to keep widening one object.
Where CrewAI saves a sprint. CrewAI's value shows up earliest on small sequential pipelines where the team is new to agents and the project lives or dies on time-to-first-demo. A typical-shape engagement: a four-person team wants to ship a sales-ops research pipeline in two weeks, doesn't have prior LangGraph muscle memory, and the work decomposes into three agents in a fixed order. CrewAI's role-task-crew vocabulary clicks in a half-day workshop, `Process.sequential` matches the pipeline pattern exactly, and the first full-loop run is usually green by day three. The trade-off lands later, when the team needs to add a branching decision (route to a 'deep research' sub-crew when the firmographic data is sparse), at which point CrewAI's process abstractions start to feel narrow and you find yourself reaching for a wrapper LangGraph anyway. So the recommendation we keep landing on is: start in CrewAI if the team is new and the pattern is genuinely sequential, plan the migration to LangGraph for the moment the graph stops being a line, and don't try to retrofit branching into a CrewAI Crew when a graph runtime is the right tool.
AutoGen edge cases. AutoGen earns its place in research-flavoured group-chat patterns, and it's the right tool for almost nothing else in 2026. The typical-engagement shape we'd point you toward AutoGen for is one where the value of the system is in agents arguing with each other — a red-team-vs-blue-team eval harness, a debate-style fact-checking loop, a research prototype where you want to watch four personas push back on a draft. The group-chat abstraction makes that easy and the Studio UI is genuinely the nicest one in this category for inspection. The places we've seen it break down are the ones where the team picked AutoGen for a production supervisor build because the demos looked clean, and then the group-chat token fan-out and the looser termination semantics started chewing through their token budget on every long conversation. If you find yourself adding custom termination logic on top of AutoGen, that's the moment to step back and ask whether LangGraph's explicit edges would have been the cheaper starting point; it almost always would have been.
Coordination cost is the unit-economics killer
A common mistake is to budget a multi-agent system as if its cost equals the sum of its agent calls. It doesn't. The supervisor sees every subagent's output, plus the original task, plus its own running plan, on every routing decision. By turn three the supervisor's input context is the sum of everything the subagents have said so far, and you're paying for that context on every supervisor turn. Add classifier-free routing (the supervisor asking a Haiku-class classifier whether to delegate) and you're paying twice on every decision.
These multipliers are typical-engagement-shape numbers we use to sanity-check a budget before the first prototype lands. The exact figure for any given system depends on prompt size and tool-call density and how aggressively the supervisor prunes context between turns. The shape, though, holds: anything beyond a sequential pipeline costs at least 2× the single-agent equivalent on tokens, and the cost grows roughly linearly in agent count plus an extra term from supervisor context growth. If your pricing model can't absorb a 4× to 6× multiplier on the agent-loop tokens, you don't have a multi-agent product yet; you have a research prototype that needs cost engineering before it ships.
The fix is mostly mechanical. Cap the supervisor's context with a rolling summary instead of the full history; force subagents to return structured JSON instead of free prose; pin the cheapest competent model per role (Haiku 4.5 routes, Sonnet 4.6 works, Opus 4.7 plans) using the same model selection patterns we use for LLM workloads; and add a hard turn cap so a runaway loop bails before it bankrupts the run. Every production multi-agent system we've shipped has these four levers turned up.
Multi agent system implementation: a working LangGraph supervisor
A pragmatic multi agent system implementation in 2026 leans on a small, boring toolchain. LangGraph 0.4.x or CrewAI 1.x on top of Python 3.11+. A vendor SDK underneath for the actual model calls (the Anthropic SDK for Claude Sonnet 4.6 or Opus 4.7, the OpenAI SDK for GPT-5). Langfuse or LangSmith for traces. Postgres for whatever shared state needs to outlive a single run. A queue (SQS, NATS, or an internal Temporal cluster) if the runs are long-lived. Nothing exotic, and nothing that the team can't operate.
# Supervisor + two specialists in LangGraph 0.4.x.
import os
from typing import Annotated, TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_anthropic import ChatAnthropic
planner = ChatAnthropic(model="claude-opus-4-7")
worker = ChatAnthropic(model="claude-sonnet-4-6")
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
next: str
def route(state: AgentState) -> Literal["researcher", "writer", "__end__"]:
# Planner decides which specialist runs next, or stops.
decision = planner.invoke(state["messages"] + [
{"role": "system", "content": "Reply with one word: researcher, writer, or done."}
])
return {"researcher": "researcher", "writer": "writer", "done": END}.get(
decision.content.strip().lower(), END
)
def researcher(state: AgentState):
out = worker.invoke(state["messages"] + [{"role": "system", "content": "Research only."}])
return {"messages": [out]}
def writer(state: AgentState):
out = worker.invoke(state["messages"] + [{"role": "system", "content": "Write a final answer."}])
return {"messages": [out]}
graph = StateGraph(AgentState)
graph.add_node("researcher", researcher)
graph.add_node("writer", writer)
graph.add_conditional_edges(START, route)
graph.add_edge("researcher", START)
graph.add_edge("writer", END)
app = graph.compile() # Equivalent crew in CrewAI 1.x.
from crewai import Agent, Task, Crew, Process
from crewai.llm import LLM
planner_llm = LLM(model="claude-opus-4-7")
worker_llm = LLM(model="claude-sonnet-4-6")
researcher = Agent(
role="Researcher",
goal="Gather facts for the writer",
backstory="Senior research analyst.",
llm=worker_llm,
)
writer = Agent(
role="Writer",
goal="Produce a tight final answer",
backstory="Editor with a low tolerance for filler.",
llm=worker_llm,
)
research_task = Task(
description="Research the user's question.",
expected_output="Bullet list of cited facts.",
agent=researcher,
)
write_task = Task(
description="Write the final reply using the researcher's bullets.",
expected_output="A 200-word answer.",
agent=writer,
context=[research_task],
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.hierarchical,
manager_llm=planner_llm,
)
result = crew.kickoff(inputs={"question": "What changed in multi-agent systems in 2026?"}) # Same supervisor shape on the Anthropic Agent SDK.
import anthropic
from anthropic.agents import Agent, tool
client = anthropic.Anthropic()
@tool
def web_search(query: str) -> str:
return run_retriever(query)
researcher = Agent(client=client, model="claude-sonnet-4-6",
system="Research only. Cite sources.", tools=[web_search])
writer = Agent(client=client, model="claude-sonnet-4-6",
system="Write a tight 200-word answer.")
planner = Agent(client=client, model="claude-opus-4-7",
system="Reply with one of: researcher, writer, done.")
MAX_TURNS = 12
state = {"q": "...", "history": []}
for turn in range(MAX_TURNS):
d = planner.run(state).output_text.strip().lower()
if d == "done": break
a = {"researcher": researcher, "writer": writer}.get(d)
if not a: break
state["history"].append({"agent": d, "out": a.run(state).output_text}) Three implementation traps we see repeatedly. First, hardcoded role prompts that don't degrade gracefully when the planner picks an unexpected route; always include a 'no-op' branch that just appends a short refusal to state and lets the supervisor decide again. Second, missing turn caps; we put a hard MAX_TURNS at the supervisor's routing function and refuse to ship without it. Third, untyped state: LangGraph's TypedDict isn't optional in our book, because once you're three commits in, you will forget what fields you added and what they meant, and the type checker is the only thing that catches it before the first 3am page.
On the training-and-serving split: agent systems don't have a training phase the way a classifier does, but they do have an eval phase that behaves like one. We treat the prompt set, tool list, and routing logic as 'weights' and keep them in version control with the same discipline a model gets, run a regression suite on every change, and only promote to production when the eval grid passes. Teams that skip this end up reverting prompt changes by hand at 3am when the support backlog spikes, which is a recoverable mistake the first time and a culture problem by the third. We've watched that pattern more often than we'd like.
What's missing from these snippets. The three code samples above are honest sketches of the routing loop, and they're also missing roughly two-thirds of what a production multi-agent system actually needs. None of them persist state across runs; each invocation starts from a fresh dict, which is fine for a notebook demo and unworkable the moment a user can resume a session. In production you'd back the state object with Postgres (a `state_snapshots` table keyed by run-id, written after every node) or wrap the whole graph in a Temporal workflow so the runtime owns the durability. None of them handle errors gracefully; a tool that raises mid-run blows up the loop instead of being caught, logged, and either retried with backoff or surfaced to the supervisor as a structured failure for re-planning. We default to a try-and-record pattern at every tool wrapper: catch, log a structured failure event with run-id and agent-id, and return a typed error object the supervisor can read and route on.
Retries deserve their own line. The snippets above retry nothing, which means a transient 429 from the model provider or a flaky tool call ends a run that would have succeeded on the next attempt. We wrap every model call and every tool call in an exponential-backoff helper (tenacity in Python, p-retry in Node) with a strict cap — three attempts, capped at 30 seconds total — and we record every retry as a separate trace span so the dashboards show retry rate as a first-class metric. Two more things missing from the sketches: there's no idempotency token on the tool calls (so a retried 'send email' tool can double-send), and there's no cost meter wired into the loop (so a runaway can outrun your alerting). The supervisor.py file in production is usually three times the size of the sketch above, and the extra lines are exactly state persistence, error handling, retries, idempotency, and cost metering. None of it is fancy; all of it is non-negotiable before you ship.
The five production failure modes nobody warns you about
Demo-stage multi-agent systems break in ways their authors didn't anticipate. Five failure modes account for the majority of incidents on the systems we audit.
One. Token-budget runaway. A supervisor that doesn't summarise grows context linearly; on a slow day, a single user query can chew through a five-figure token bill in a single run. A concrete trace shape we've seen on an audit: turn 1 input is 2k tokens, turn 6 input is 38k tokens, turn 12 input is 110k tokens, and the supervisor still hasn't decided to stop. The cost curve in the trace UI looks like a hockey stick. Detection: alert on per-run token count exceeding a fixed budget (we typically set this at 5× the median run for the workload). Prevention snippet: after every supervisor turn, replace the oldest N messages with a one-paragraph summary block tagged `summary_of_turns_1_to_5`, and impose a hard MAX_TURNS = 12 in the routing function that returns 'done' on hit. The combination cuts the 99th-percentile run cost by an order of magnitude with almost no quality loss.
Two. Role collision. Two agents with overlapping prompts converge on the same answer style and stop adding value. The trace signature is unmistakable once you've seen it: agent A returns a 400-word answer, agent B returns a near-identical 410-word answer with slightly different phrasing, and the supervisor's aggregator concatenates both and ships 800 words of redundant prose to the user. We've watched this in a fact-checker-and-editor pair where both agents drifted into 'comprehensive review' over time. Detection: a cheap eval that asks a Haiku-class judge 'did agent B's output strictly subsume agent A's?'; if the answer is yes on more than a small fraction of cases, the roles are too close. Prevention snippet: write each role contract as a one-sentence 'must do X and must not do Y' constraint inside the system prompt, and add an automated diff check post-run that flags high token-overlap between supposed-to-be-different agents.
Three. Context-window OOM. The supervisor's input grows past the model's window and the system silently truncates, often dropping the earliest task description. The trace example we still cite in workshops: a supervisor on a 200k-token-window model passes 205k tokens of context on turn nine, the API quietly drops the first 5k tokens (which happened to include the user's original instructions), the system produces a confident but off-topic answer, and the user reports a 'hallucination' that's actually just the original prompt being silently forgotten. Detection: log every model call's input token count alongside the model's context limit, and alert when usage crosses four-fifths of the window. Prevention snippet: pin the supervisor to the longest-context model available (Claude Opus 4.7 at 1M, Gemini 3.0 Pro at 2M), summarise older turns aggressively (drop everything older than the last 8 turns into a single summary block), and re-inject the original task description at the head of every supervisor input as a non-negotiable system message.
Five. Untraceable tool calls. An agent calls a tool whose effect isn't logged in the conversation buffer, and a later audit can't reconstruct what happened. The trace shape that gives this away: the conversation buffer shows a clean response, but a downstream system (Salesforce, Stripe, a CRM webhook) shows a write that the trace can't account for. We've watched a team spend a full sprint reconstructing what a five-agent run did when finance flagged a four-figure vendor charge with no corresponding trace span. Detection: every tool call must emit a structured log line with run-id, agent-id, tool-name, arguments, and outcome — and a nightly job should reconcile tool-side events against trace-side events and alert on the delta. Prevention snippet: instrument the tool registry once, at the wrapper layer (`@traced_tool` decorator that wraps every tool function and emits an OTel span on entry and exit), instead of asking every agent to be polite about logging. The investment is half a sprint and pays for itself the first time legal asks what the system did on a specific request.
Observability and eval — the agent-trace stack
An agent system without a trace stack is undebuggable in production. The three tools we lean on, in order of how often they show up in our engagements: Langfuse for self-hosted or hybrid teams that want full ownership of trace data, LangSmith for teams already on the LangChain ecosystem, and Inspect AI from the UK AI Safety Institute when the eval suite needs to be defensible to a regulator. All three answer the same three questions: which agent ran when, what did each one see and emit, and how long did each step take.
The eval harness is the part most teams underbuild. A multi-agent eval suite has to cover three layers. Per-agent unit evals (does the researcher cite real sources; does the writer hit the format spec). Per-handoff evals (does the supervisor route to the right specialist given a known input). And full-system evals (does the whole graph produce a correct final answer on a held-out test set). We typically maintain 50 to 200 cases per layer and re-run them on every prompt change, which sounds expensive and isn't, because most of the cost is one-time fixture authoring and the runs themselves take minutes on a small batch.
Observability tooling pairs naturally with the kind of workflow automation engagements where we run the orchestration ourselves; you don't get to run an agent system in production without a trace UI, and you don't get to ship a workflow automation product without one either.
There's a second, quieter discipline that makes multi-agent systems serviceable in production: structured logging at the tool boundary. Every external call an agent makes should land in your logs as a structured event with a run-id and an agent-id and a timestamp. We've watched teams skip this and then spend a full sprint reconstructing what a five-agent run did when a downstream system flagged a bad write. The wrapper that emits these logs is half a day of work; the absence of it is two weeks of forensic spelunking. The math isn't subtle.
On eval cadence: we recommend running the full suite on every prompt change and on every framework upgrade, with a weekly canary on production traffic. Production canaries don't need to be perfect, but they catch the kind of drift that hand-rolled fixtures miss. Two failure modes only show up under real traffic: long-tail input formats that the test set didn't capture, and emergent supervisor behaviours that only appear when context-window pressure is real. A weekly canary across 50 to 100 real (anonymised) inputs catches both. We've yet to meet a team that regrets adding it; we've met several who regret not adding it sooner.
Langfuse: the open-core default for self-hosted teams
Langfuse is the trace tool we pick most often, and the reason is unsexy: it's open-core, it self-hosts cleanly on a single Postgres plus a small worker, and the data stays on infrastructure the client already owns. The UI gives you the agent tree we keep showing in workshops — parent span for the supervisor, child spans for each subagent call, grandchild spans for each tool invocation, with token counts and latency on every span — and the SDK works across LangGraph, CrewAI, the OpenAI Agents SDK, the Anthropic Agent SDK, and raw model calls with the same instrumentation pattern. The weak spot is the eval product: Langfuse's eval harness is competent but not opinionated, and teams that want a strong opinionated workflow often layer DeepEval or promptfoo on top. Pick Langfuse when data residency matters, when the team wants to own the trace store, or when the stack is multi-vendor and you don't want to bet the observability layer on one model provider's tooling.
LangSmith: the LangChain-native pick when the team is all-in on the ecosystem
LangSmith is what we reach for when the team is already deep in LangChain and LangGraph. The integration is essentially zero-config — you set two environment variables and every node in your graph emits a span — and the dataset-and-eval product is the most opinionated of the three, with first-class support for LLM-as-judge evals, regression test suites, and prompt-version diffing inside the same UI as the traces. The weak spots are predictable: LangSmith is a hosted SaaS by default (self-hosting is a paid enterprise tier), pricing scales with trace volume which can sting on a busy production system, and the tightest features assume you're using LangChain abstractions throughout the stack. If the team isn't already on LangChain, the integration is fine but you're paying for features you won't fully use. Pick LangSmith when LangChain is the substrate, when the eval-driven-development discipline matters, and when the team values speed-of-setup over data ownership.
Inspect AI: the eval framework regulators read
Inspect AI is the odd one out and the one we recommend when the eval results have to defend themselves to a third party. Built and maintained by the UK AI Safety Institute, it's an open-source Python framework whose primary job isn't tracing your production traffic — it's running rigorous, reproducible eval suites of the kind that show up in regulatory submissions and frontier-model evaluations. The trace UI is functional rather than slick, but the eval primitives (solvers, scorers, datasets, plans) are the most carefully thought-out in this category, and the audit-trail format is taken seriously enough that we've seen it referenced directly in AISI and NIST submissions. Pick Inspect AI when the system is high-stakes enough that a regulator or a procurement team will ask 'show us your eval methodology' — and pair it with one of the other two for day-to-day production tracing rather than asking it to be both.
When NOT to use a multi-agent system
Three opinionated calls we'll defend in any client review. They aren't always popular in the kickoff workshop. They've been right often enough that we keep making them.
Call one. Default to a single agent with parallel tool calls. Roughly six or seven out of every ten use cases that walk through our door with a multi-agent ask are better served by one well-prompted Claude Sonnet 4.6 or GPT-5 agent that can call its tools in parallel. The architecture is simpler, the cost is lower, the trace is one tree instead of a graph, and the failure modes are the ones every team already knows how to handle. If a single agent with the right tool set can produce the answer in one or two turns, splitting the work across agents is yak-shaving with extra bills.
Call two. Reach for multi-agent when the work decomposes into specialist roles that genuinely use different prompts, models, or tool sets. A researcher that runs on Sonnet with web-search tools, a fact-checker that runs on Opus with no tools, and a writer that runs on Sonnet with a style guide is a system where the role split pays. A 'planner agent' and a 'coder agent' that both run on the same model with the same tools and a slightly different system prompt is not.
Call three, the one teams underweight. Ship a single-agent baseline first, then decompose. Premature multi-agent decomposition is the new premature microservices: it costs you simplicity up front, locks in a topology before you've measured the bottleneck, and makes the next refactor harder. The right sequence is: ship single-agent, instrument it, find the actual quality or latency gap, split only where the measurement says splitting helps.
Multi agent system guide: a 7-step build checklist
Use this short multi agent system guide as a sanity check before you spend the first sprint on a multi-agent build. We run a version of it inside every kickoff workshop.
Step by step. One, write the task in one sentence; if you can't, the system isn't ready to be designed. Two, ship a single-agent baseline with parallel tool calls and instrument it; only proceed if you can name a measured gap. Three, pick the orchestration pattern (supervisor for most cases, swarm only for parallel research workloads, pipeline only for deterministic chains). Four, pick the framework that matches the pattern and your team's observability needs. Five, pin the cheapest competent model per role. Six, wire up Langfuse or LangSmith on day one. Seven, set hard caps on tokens per run and turns per run and dollars per day, and refuse to ship without them.
One closing note on the checklist. Steps three through five almost always change after the team has run the single-agent baseline for two weeks; steps one and two rarely do. If you get the task definition and the baseline right, the rest of the design is recoverable. If you skip them, no amount of framework polish saves you. We've seen the same lesson land in three different industries this year, and we expect to see it again — every robust multi agent system we've shipped started from a single-agent baseline that earned its split, not a topology picked in the kickoff workshop.
Frequently asked questions about multi-agent systems
Is a multi-agent system always better than a single agent?
No, and it's usually worse for the task. A well-prompted single agent with parallel tool calls handles roughly six or seven out of every ten use cases teams ask us to scope. Multi-agent earns its place when the work decomposes into roles that genuinely use different prompts or different models or different tool sets. If yours doesn't, you're paying a 2× to 6× token premium for nothing.
Which orchestration pattern should I start with?
Supervisor. One coordinator routing to N specialist subagents covers roughly four out of every five production builds we see. Reach for swarm only when the work is genuinely parallel and idempotent (deep research is the canonical case). Reach for hierarchical only when one supervisor's context window can't hold the full plan. Reach for blackboard when agents are different services owned by different teams.
LangGraph, CrewAI, or AutoGen — which framework should I pick?
LangGraph for explicit-state supervisor builds with strong observability needs. CrewAI for fast prototyping and small sequential pipelines, especially when the team is new to agents. AutoGen for research-flavoured group-chat patterns. If your team is already all-in on OpenAI or Anthropic, the vendor SDK underneath is usually the better fit for the inner loop; wrap it in LangGraph or CrewAI on the outside.
How do I budget for a multi-agent system in production?
Start from the single-agent token cost as a 1.0× baseline and apply a typical-shape multiplier per pattern: 2× to 2.5× for a sequential pipeline, 3× to 4× for a supervisor with three subagents, 5× to 7× for a supervisor with five subagents or a small hierarchy. Then add the supervisor's context-growth term, which is roughly proportional to turn count. If your pricing can't absorb that, redesign before you ship.
What's the worst failure mode in a multi-agent system?
The infinite handoff loop. Two peers in a swarm keep disagreeing and handing the task back; without a hard turn cap, the loop runs until your alert fires. The fix is mechanical: count handoffs per run, cap at a known threshold (we use 12 as a generic ceiling), add a third-agent arbiter, and return a best-so-far answer with a degraded-confidence flag when the cap trips.
Do I need a separate trace tool for multi-agent systems?
Yes. A flat LLM-call log doesn't tell you which agent ran when, what each one saw, and how the supervisor decided to route. Langfuse, LangSmith, and Inspect AI all give you the tree view a multi agent system needs. Wire one of them up on day one; debugging a five-agent system without traces in production is a sprint-burner the first time and a culture problem by the second.
Designing a multi-agent system in 2026?
We help product teams pick the right orchestration pattern for a new multi agent system, size the token bill, and ship the first production agent loop in weeks instead of months.