Why Multi-Agent Systems Outperform Single Agents
A single LLM agent trying to handle every aspect of a complex task is like a one-person startup trying to do sales, engineering, design, and support simultaneously. It can work for simple tasks, but quality degrades as complexity increases. The agent's instructions become bloated, tools compete for attention, and the model struggles to maintain focus across diverse responsibilities.
Multi-agent systems solve this through specialization. Each agent has a narrow focus: one agent researches, another writes, a third reviews, and a fourth handles formatting. Each agent's instructions are concise and specific, its tool set is curated for its role, and it can be individually optimized and tested.
The empirical evidence is compelling. Research from multiple teams shows that multi-agent systems consistently outperform single agents on complex tasks. A study on code generation found that a three-agent system (planner, coder, reviewer) produced code with 30% fewer bugs than a single agent with all three capabilities. The reason is not that multi-agent systems use smarter models; they use the same models more effectively by reducing the cognitive load on each agent.
Multi-agent systems also provide natural error correction. When a reviewer agent evaluates a writer agent's output, it catches errors that the writer missed. This adversarial dynamic improves output quality without requiring more capable models. It mirrors how human teams work: peer review catches mistakes that self-review misses.
The challenge is orchestration complexity. Coordinating multiple agents requires careful design of communication patterns, shared state management, and error handling. A poorly designed multi-agent system can be worse than a single agent if agents conflict, loop endlessly, or lose context. The patterns in this tutorial help you avoid these pitfalls.
Multi-Agent Architecture Patterns
Four architecture patterns cover the majority of multi-agent use cases. The supervisor pattern has a central orchestrator agent that delegates tasks to worker agents based on the current need. The supervisor reads the conversation, decides which worker should handle the next step, passes context, collects the result, and either delegates to another worker or responds to the user. This is the most common pattern and works well for customer service, research assistants, and general-purpose applications.
The sequential pipeline pattern arranges agents in a fixed order, like an assembly line. Each agent processes the output of the previous one. Content generation is a natural fit: research agent produces findings, writer agent drafts the article, editor agent polishes the prose, and fact-checker agent verifies claims. The simplicity of this pattern makes it easy to debug and monitor.
The debate or adversarial pattern has agents that argue different positions before reaching a consensus. A proposer agent generates a solution, a critic agent identifies weaknesses, and the proposer revises based on the critique. This back-and-forth continues until the critic is satisfied or a maximum number of rounds is reached. This pattern excels for decision-making tasks, code review, and analysis where thoroughness matters.
The hierarchical pattern nests teams within teams. A top-level supervisor delegates to mid-level supervisors, each of which manages a team of workers. This scales to complex workflows like building a software feature: the project manager agent delegates to a design team (UI agent, UX agent) and a development team (frontend agent, backend agent, testing agent). Each sub-team can use a different pattern internally.
Choose the simplest pattern that fits your use case. Over-engineering the architecture adds complexity without proportional benefit. Start with a supervisor pattern and graduate to hierarchical only when the supervisor's decision space becomes too large.
Shared State and Agent Communication
How agents share information is the most critical design decision in a multi-agent system. The two approaches are message-passing and shared state, each with distinct tradeoffs.
Message-passing sends the conversation history or a summary to each agent. When the supervisor delegates to a worker, it passes relevant messages. The worker responds, and the supervisor incorporates the response into the conversation. This is the default approach in both LangGraph (via the messages list in state) and the OpenAI Agents SDK (via handoffs). The advantage is simplicity; the disadvantage is that context grows with each agent interaction, potentially exceeding the context window.
Shared state uses a structured state object that all agents can read and write. In LangGraph, this is the TypedDict state of the graph. You define fields for each type of information: research_results, draft_content, review_feedback, and final_output. Each agent reads the fields it needs and writes its output to the appropriate field. This is more efficient than message-passing because agents only access relevant information, but it requires careful state schema design.
For production systems, a hybrid approach works best. Use shared state for structured intermediate results and message-passing for natural language coordination between agents. The supervisor agent uses messages to communicate with workers, while workers write their structured outputs to shared state fields that other workers can access directly.
Memory management is crucial as conversations grow. Implement summarization: periodically summarize older messages into a compact form and discard the originals. Use a sliding window that keeps the last N messages in full while summarizing everything before. For multi-session applications, persist the shared state to a database between sessions and load relevant portions when the conversation resumes.
Beware of state conflicts when agents run concurrently. If two agents try to update the same state field simultaneously, you need a merge strategy. LangGraph handles this with reducer functions that define how concurrent updates combine, such as appending to a list or taking the latest value.
Implementing a Multi-Agent System with LangGraph
Let us implement a supervisor pattern using LangGraph. The system has a supervisor agent that routes to three specialist agents: a researcher, a writer, and a coder. The supervisor decides which agent to call based on the user's request and can call agents multiple times in different orders.
Start by defining the state with a messages field, a next_agent field for routing, and domain-specific fields for each agent's output. The supervisor node calls an LLM with instructions explaining the available agents and asks it to choose which agent should handle the current step. Use structured output to get a clean routing decision.
Each specialist agent is implemented as a subgraph or a node that calls an LLM with role-specific instructions and tools. The researcher has web search and document retrieval tools. The writer has a long-form generation prompt. The coder has a code generation prompt and a code execution tool for testing.
Conditional edges route from the supervisor to the chosen specialist and from each specialist back to the supervisor. The supervisor evaluates the specialist's output and either delegates to another specialist, asks the same specialist to refine its work, or produces a final response.
The termination condition is critical. Without it, the supervisor might loop indefinitely. Implement two safeguards: a maximum iteration count in the state that decrements with each agent call, and an explicit FINISH option in the supervisor's routing choices that the LLM can select when the task is complete.
Add checkpointing with a PostgresSaver for production deployment. This allows the system to survive restarts, support human-in-the-loop patterns, and maintain conversation state across sessions. Each user conversation gets a unique thread_id, and the entire state is persisted after every node execution.
Test the system thoroughly by mocking LLM responses and verifying that routing logic works correctly for various scenarios. Test the happy path, error cases, maximum iteration limits, and edge cases like the user changing topics mid-conversation.
Debugging, Testing, and Scaling Multi-Agent Systems
Multi-agent systems are notoriously difficult to debug because the execution path is non-deterministic and depends on LLM decisions at each routing point. Invest heavily in observability from day one.
LangSmith is the primary debugging tool for LangGraph-based multi-agent systems. Every agent call, tool execution, and routing decision is captured in a hierarchical trace. When something goes wrong, you can see exactly which agent was active, what context it received, what it generated, and how the supervisor interpreted the result. Tag traces with user IDs and session IDs to correlate with user reports.
Unit testing focuses on individual agent nodes. Mock the LLM to return predetermined responses and verify that the node produces the correct state updates. Test the supervisor's routing logic by providing various conversation states and verifying it routes to the expected agent. Integration tests run the full graph with mocked LLMs to verify end-to-end behavior for key scenarios.
Evaluation of multi-agent systems requires task-level metrics rather than turn-level metrics. Define success criteria for each task type: a research task is successful if it finds relevant sources, a writing task is successful if the output meets quality standards, and the overall system is successful if the final response addresses the user's request completely and accurately. Build an evaluation dataset with 50-100 examples covering diverse scenarios.
Scaling multi-agent systems introduces concurrency challenges. If two users are interacting with the system simultaneously, their state must be completely isolated. LangGraph's checkpointing handles this by keying state on thread_id. For horizontal scaling, deploy multiple instances behind a load balancer and use a shared database (PostgreSQL) for checkpointing so any instance can resume any thread.
Cost management at scale requires careful token budgeting. Track tokens per agent, per tool call, and per routing decision. Identify which agents consume the most tokens and optimize their prompts. Consider using cheaper models for routine agents (researcher, formatter) and reserving expensive models for critical agents (supervisor, quality reviewer).
Code Example
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict, Annotated, Literal
import operator
class State(TypedDict):
messages: Annotated[list, operator.add]
next_agent: str
def supervisor(state: State):
llm = ChatOpenAI(model="gpt-4o")
response = llm.with_structured_output(RouteSchema).invoke(
[{"role": "system", "content": "Route to: researcher, writer, or FINISH"}]
+ state["messages"]
)
return {"next_agent": response.next}
def route(state: State) -> Literal["researcher", "writer", "__end__"]:
if state["next_agent"] == "FINISH":
return "__end__"
return state["next_agent"]
graph = StateGraph(State)
graph.add_node("supervisor", supervisor)
graph.add_node("researcher", researcher_node)
graph.add_node("writer", writer_node)
graph.set_entry_point("supervisor")
graph.add_conditional_edges("supervisor", route)
graph.add_edge("researcher", "supervisor")
graph.add_edge("writer", "supervisor")