You picked up MCP because it promised standard tool access. You looked at A2A because you needed agents to delegate work to each other. You stood up a framework — AutoGen, CrewAI, LangGraph — and wired the pieces together. The demo worked. Then you tried to ship something real, and you hit the wall the entire stack is built around: it moves messages between agents. It does not help you decide whether to trust what arrived.
That gap — the verification gap at the handoff boundary — is where multi-agent systems live or die. It is also the gap no protocol or framework currently owns.
Here is the landscape as it actually stands. What the protocols solve, where they stop. What the frameworks do, what failure profiles they impose. What the research says about whether multi-agent is worth the complexity. And where the gaps are that you still have to fill yourself.
The Protocol Layer
Two protocols define the current stack. They come from different companies, solve different problems, and are explicitly designed to complement each other.
MCP (Model Context Protocol) is Anthropic's. It standardizes how an agent connects to external capabilities — databases, APIs, files, tools. The architecture is host-client-server: a host application holds clients that connect to servers providing resources, prompts, and executable tools. Transport is JSON-RPC 2.0. It was inspired by the Language Server Protocol, which solved the same integration problem for code editors.
What MCP solves: the N×M integration problem. Without it, every agent-framework pair needs custom code to talk to every tool. With it, any MCP-compatible server works with any MCP-compatible client. That is real.
Where MCP stops: the protocol's own specification is explicit. The security and trust section states that "MCP itself cannot enforce these security principles at the protocol level" and that tool descriptions "should be considered untrusted, unless obtained from a trusted server." No verification primitives. No audit trail at the protocol level. No trust decisions. You get a typed interface to capabilities. You do not get a way to know those capabilities are behaving correctly.
A2A (Agent-to-Agent Protocol) is Google's. It is now an open-source project under the Linux Foundation with an Apache 2.0 license. Where MCP connects an agent to tools, A2A connects agents to each other. One agent delegates work to another, tracks status, and receives results.
The core design choice is opacity. Agents collaborate without sharing internal state, memory, or tool implementations. Each agent is a black box to the others. They exchange structured JSON-RPC messages, discover each other through Agent Cards (self-description documents that advertise capabilities and connection info), and interact synchronously, via streaming, or through asynchronous push notifications.
Opacity is a deliberate security decision. It is also a trust problem. If you cannot inspect the internal reasoning of the agent you handed work to, you have to trust its output without understanding its process.
The two protocols are explicitly complementary. A2A's own documentation frames it as "enabling agents to collaborate with each other" while MCP handles agent-to-tool access. Different problems, different protocols, same gap: neither specifies how to verify that a delegated task was completed correctly, or how to enforce trust at the handoff boundary.
The historical rhyme. FIPA — the Foundation for Intelligent Physical Agents — tried to standardize agent communication in the late 1990s. FIPA ACL defined message types based on speech-act theory: Inform, Request, Query, Propose. The Contract-Net Protocol handled task allocation through a bid-and-award mechanism. FIPA was dissolved around 2005. The standards never achieved broad commercial adoption outside academic research.
The lesson is three decades old: standardizing message formats is necessary but not sufficient. Trust, verification, and error recovery do not solve themselves. MCP and A2A are better engineered than FIPA was. They face the same wall.
My read: Adopt MCP and A2A for what they do — integration and transport. They reduce the custom-code tax meaningfully. Do not expect them to handle trust, verification, or error recovery, because they explicitly do not. If your architecture depends on the protocol layer for those problems, you have a gap you have not filled yet.
The Framework Layer
The protocols move messages. The frameworks decide the topology those messages flow through — and topology determines your failure profile more than any feature comparison.
AutoGen (Microsoft) pioneered the conversational multi-agent model: agents share a chat context and pass messages back and forth. It is flexible and natural. It is also hard to control — and as of the current README, AutoGen is in maintenance mode. Microsoft recommends their Agent Framework (MAF) as the production-ready successor, with graph-based orchestration, built-in OpenTelemetry, and native A2A and MCP interoperability. If you are evaluating AutoGen today, the forward path is MAF, not AutoGen.
The failure profile of conversational topologies is error propagation through shared context. A bad output enters the chat. Downstream agents process it as valid. The topology provides no quarantine mechanism. You get a conversation. You do not get a referee.
CrewAI takes a different approach. Agents have explicit roles — researcher, analyst, writer — with defined tasks and expected outputs. The architecture splits into Crews (autonomous, role-based collaboration) and Flows (event-driven, production-oriented orchestration). CrewAI is built independently of LangChain and claims over 100,000 developers certified through community courses.
The failure profile of role-based hierarchies is node poisoning. If the researcher produces bad context, the analyst processes bad context. You get clean attribution — role labels tell you where the error originated — but the error still propagates downstream until you catch it manually.
LangGraph (LangChain) is graph-structured. Agents and tools are nodes in a directed graph. State is explicit, passed between nodes, and persists across cycles. The design draws from Pregel and Apache Beam — this is a state machine framework, not a chat framework. It supports durable execution that survives failures and resumes from checkpoints.
The failure profile of graph topologies is state inconsistency across nodes in cyclic or conditional flows. The graph structure makes debugging tractable — you can trace the execution path — but it does not prevent errors. LangGraph gives you a state machine. You still have to build the debugger.
My read: The framework landscape is more fluid than it looks. AutoGen entering maintenance mode mid-stream is itself evidence that this space is not settled. Choose based on the debugging model you can support, not the feature list. Conversational frameworks give you flexibility and messy error propagation. Hierarchical frameworks give you attribution and downstream poisoning. Graph frameworks give you explicit state and state-inconsistency debugging. None of them gives you verification for free.
What the Research Says
The research base has matured. Recent work has moved from anecdote to measurement, and the measurements are not encouraging for the "just add more agents" school of thought.
79% of multi-agent failures are structural. A 2025 study that reviewed over 1,600 annotated production traces across LangGraph, AutoGen, CrewAI, and OpenAI's Assistants API found that the overwhelming majority of failures are architectural, not model-level. Specification failures alone — ambiguous task definition, role underspecification, constraint underspecification, goal misalignment — account for 42% of all failures. The paper's language is blunt: none of them are addressable by a better model. They respond to engineering, not capability upgrades. [1]
Errors cascade like epidemics. Recent work models multi-agent error propagation using SIR epidemiological models — the same mathematics used for disease spread. The transmission threshold depends on the spectral radius of the dependency graph: cyclic architectures amplify errors, tree architectures do not. Five of six multi-agent frameworks evaluated reached 100% network infection from a single erroneous input. The one that did not was a strict tree topology where the spectral radius was 1.0. The paper proposes a BICR mitigation pattern — Buffer, Isolate, Challenge, Recover — that reduces cascade probability from 0.32 to 0.094, a 3.4× reduction. No framework ships it as a default. [2]
Delegation has reliability limits. A 2026 technical note models LLM-based multi-agent planning as a delegated decision network and shows a hard constraint: without new external signals, a delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. Communication and compression add loss. The practical read is simple: splitting a problem across agents does not create reliability by itself. It creates interfaces where information can be compressed, distorted, or dropped. [3]
Agents that pass testing fail at scale. An evaluation framework for enterprise agentic systems found that agents produce consistent results about 60% of the time without load, degrading to 25% at production scale. These agents pass standard accuracy metrics on any given run. They fail the predictability test: the same query should produce the same verdict. [4]
The faults you cannot reproduce in testing. A runtime fault taxonomy cataloged 37 fault types across open-source agentic repositories. The most striking finding: credential expiration and token-refresh failures predict authentication failure with a statistical lift of 181.5 — invisible at the agent-logic layer, visible only with distributed tracing correlating events across the full agent graph. These faults do not reproduce in test suites. They happen under load, under timing pressure, in the gaps between agent calls. [5]
The single-agent baseline. METR's long-horizon task benchmarks provide context that makes the multi-agent findings more concerning, not less. The length of tasks AI agents complete with 50% reliability has been doubling roughly every seven months — an impressive trend. But the current horizon for frontier models sits at about one hour at 50% success, with less than 10% success on tasks requiring more than four hours. If a single agent already struggles to maintain coherent state over long sequences, composing multiple agents — each carrying that same limitation — compounds the problem rather than distributing it. [6]
My read: The research is unambiguous on one point. Adding agents adds error surface, not intelligence, unless you build verification in. The 79% structural failure finding means the problems are architectural, not model-level. Spending budget on a better model is the wrong fix. Spending it on specification clarity, verification gates, and observability is the right one.
Where the Gaps Actually Are
The protocols solve transport. The frameworks solve topology. The research says the value and the danger live in a layer neither owns.
Trust at handoff. When agent A delegates to agent B, how does A know B completed the work correctly? MCP says this is not a protocol-level concern. A2A makes it worse by design — opacity means you cannot inspect B's internal reasoning. No protocol provides verification primitives. You build them.
I run a multi-agent dispatch system myself, and this is the gap I spend the most time on. The stack gives me structured handoffs between agents — task requests, status tracking, response formatting. What it does not give me is a way to know that the agent on the receiving end actually did the work correctly before its output enters the downstream pipeline. I had to build verification gates, audit logging, and shape validation myself because nothing in the protocol or framework layer provided them. That is not a complaint about the tools. It is an observation about where the layer boundary sits.
A concrete version: Agent A asks Agent B to update a customer onboarding flow. Agent B returns a clean success message: files changed, tests passed, summary attached. The handoff looks complete. But the actual change updated the UI copy and missed the validation schema, so production now accepts a field the backend rejects. Nothing in MCP catches that. A2A can carry the response, status, and artifact references, but it cannot tell you whether “onboarding flow updated” meant the same thing to both agents. The missing control is a handoff contract: what claim is being made, what evidence supports it, what independent check validates it, and what happens if the check fails.
Observability across boundaries. When a multi-agent system breaks, the fault taxonomy shows the root cause is often invisible at the agent-logic layer. The credential-cascade finding — lift of 181.5 — is the canonical example: the failure manifests as an authentication error, but the root cause is a token-refresh timing issue several layers removed. Frameworks have begun shipping tracing tools — MAF includes OpenTelemetry, LangGraph has LangSmith, CrewAI has a Crew Control Plane — but these are framework-specific silos, not cross-framework standards. Observability across a mixed-framework agent graph is still something you wire together yourself.
Error recovery versus amplification. The standard multi-agent response to an agent failure is to spawn another agent to handle it. The cascade research says this works when failures are independent and the task decomposes cleanly. It fails when failures are contextual — when the error lives in shared assumptions, not in execution. In that case, spawning a recovery agent adds a node to an already-infected graph. An analysis from Zartis's multi-agent research identifies the "semantic circuit breaker" — catching HTTP 200 responses that contain wrong content — as the most important unbuilt tool in this space. No framework provides it.
No current protocol or framework provides these as primitives. You construct them yourself, or you accept the failure modes of the systems you chose.
The Bottom Line
If you are evaluating multi-agent architectures right now, here is the honest picture.
The protocols are real and worth adopting. MCP standardizes tool access. A2A standardizes agent-to-agent task delegation. They reduce integration costs. They do not provide trust enforcement — and they are honest about that in their documentation.
The frameworks are usable but unsettled. AutoGen is in maintenance mode; MAF is its successor. CrewAI and LangGraph are actively developed. They impose different topologies that produce fundamentally different failure profiles. Choose the debugging model you can support.
The research is clear. 79% of multi-agent failures are structural, not model-level. Errors cascade through cyclic architectures — five of six frameworks reached total infection from a single error. Sequential chains degrade exponentially past three agents. Agents that pass testing fail predictably at scale. Adding agents without verification adds error surface, not intelligence.
The gap — the one no protocol or framework owns — is the operational layer: verification at handoff, observability across boundaries, and recovery that does not amplify the original fault.
So the practical question is not “Which agents can talk to each other?” That part is getting standardized. The question is: when an agent hands you work, what makes it safe to believe?
Build that layer, or inherit the failure modes of everyone underneath you.
Sources:
[1] "Why Do Multi-Agent LLM Systems Fail?" — arXiv:2503.13657
[2] "From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration" — arXiv:2603.04474
[3] "On the Reliability Limits of LLM-Based Multi-Agent Planning" — arXiv:2603.26993
[4] "Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems" — arXiv:2511.14136
[5] "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" — arXiv:2603.06847v1
[6] METR, "Measuring AI Ability to Complete Long Tasks" — March 2025