Data Science

Multi-Agent Hallucinations Are a Context Problem, Not a Model Problem

When an agent chain produces wrong output, the instinct is to blame the LLM. But multi-agent hallucinations almost always originate in broken context handoffs — not model capability. Here's the failure taxonomy and the engineering fixes.

Meritshot13 min read
AI AgentsLLMMulti-Agent SystemsData ScienceMachine Learning
Back to Blog

Multi-Agent Hallucinations Are a Context Problem, Not a Model Problem

You tested Agent A in isolation. It retrieved the right data. You tested Agent B in isolation. It summarised accurately. You tested Agent C in isolation. It wrote clean, factual output.

Then you chained them together and the final answer was confidently, completely wrong.

You spent the next two days adjusting prompts. You switched to a more capable model. You added more detailed instructions. Nothing worked — because you were fixing the wrong thing.

The hallucination was not generated by any single agent. It was assembled, piece by piece, through a broken context handoff. The model was never the problem. The plumbing between models was.


Why Multi-Agent Hallucination Is Misdiagnosed Almost Every Time

The instinct when an agent chain produces wrong output is to blame the LLM. It is the most visible component. It produced the bad text. It feels like the culprit.

But in a multi-agent system, the LLM is only as good as the context it receives. When Agent B hallucinates, the first question should not be "which model should I use instead?" It should be "what did Agent B actually receive as input, and how was that input constructed?"

In most debugging sessions this question never gets asked — because developers do not log inter-agent context. They log inputs and outputs at the edges of the system, not the handoffs in the middle. The corruption is invisible precisely where it matters most.

This is the core misdiagnosis pattern: treating a context assembly failure as a model capability failure.

The pattern compounds when teams escalate to larger, more expensive models. GPT-3.5 hallucinates, so they switch to GPT-4. GPT-4 hallucinates more confidently on the same broken input, so they try Claude. Claude produces a beautifully written, well-structured, completely wrong answer. The model got better. The output got worse.

What makes this particularly dangerous at the enterprise level is that more capable models are better at sounding authoritative. A weaker model producing wrong output at least sounds uncertain. A frontier model receiving corrupted context produces output that reads like it was written by a subject matter expert — which means it travels further through review processes before anyone catches it.


The Telephone Game at Enterprise Scale

In 2023, a fintech company built a three-agent research pipeline to automate equity reports. Agent 1 retrieved financial data from an internal database. Agent 2 analysed the data and produced a structured summary. Agent 3 wrote the final narrative report.

The system looked clean. Each agent had a clear job. The prompts were well-crafted. The model was GPT-4.

Six weeks after deployment, a portfolio manager flagged that a report had cited a revenue figure that was 40% higher than the actual number. Nobody had changed the model. Nobody had changed the prompts.

What had changed: the financial database had been updated with a new schema. Agent 1 was now retrieving data correctly, but its output format had shifted slightly — a field that Agent 2 expected as a number was now occasionally returned as a formatted string with a currency symbol. Agent 2's analysis prompt had no validation logic. It parsed the string, misread the magnitude, and passed a corrupted number forward as a clean integer. Agent 3 received a plausible-looking number, had no way to know it was wrong, and wrote a confident report around it.

What went right: Each agent's individual logic was sound.

What went wrong: There was no schema contract between agents. Agent 2 trusted Agent 1's output unconditionally.

Why it matters: In financial services, a 40% revenue misstatement is not a UX bug. It is a regulatory event.

The same pattern plays out in legal tech (citation truncation), healthcare (symptom omission from clinical notes), and e-commerce (VAT ambiguity in pricing). The common thread across all of these: the failure point is a trust assumption made silently at a handoff boundary.


What Context Actually Contains — And Why That Makes It Hard to Validate

Before you can fix inter-agent context, you need a precise understanding of what it contains. Most developers think of context as "the previous agent's output." In practice it is far messier.

By the time a message reaches Agent 3 in a production pipeline, the context window typically contains:

  • The system prompt defining the agent's role and constraints
  • The original user query — often paraphrased or summarised by this point
  • Tool call results from external APIs, databases, or search engines
  • Prior agent outputs — sometimes in full, sometimes summarised
  • Intermediate reasoning steps if chain-of-thought prompting is active
  • Error messages from failed tool calls, which the agent may attempt to reason around
  • Injected metadata like timestamps, session IDs, or routing instructions

Every one of these components can carry corruption. Tool call results can be stale. Prior agent summaries can be lossy. Error messages from failed calls can be misinterpreted as factual content. Intermediate reasoning from an earlier agent can anchor a downstream agent to a wrong assumption before it even begins its task.

The context window is not a clean input. It is a palimpsest — layered, partially overwritten, carrying traces of every prior decision in the chain. Treating it as a reliable single source of truth is the architectural mistake that causes most multi-agent hallucinations.


The Five Context Failure Modes That Cause Most Multi-Agent Hallucinations

These are not edge cases. Every production multi-agent system encounters at least two of these within the first three months of deployment.

Failure Mode 1 — Context Collapse

Agent A produces rich, nuanced output. The system summarises it before passing it to Agent B to keep the context window manageable. The summarisation loses a critical qualifier — "revenue increased 12% year-over-year excluding acquisitions" becomes "revenue increased 12%." Agent B builds analysis on the stripped version. The nuance is gone and never comes back.

  • Pro of summarisation: Keeps token costs manageable in long chains
  • Con: Lossy compression of structured data is irreversible once passed downstream

Failure Mode 2 — Context Contamination

Agent B retrieves external data using a tool call. The tool returns a result that includes stale information — a pricing page cached from six months ago, a policy document that has since been updated. Agent B has no mechanism to distinguish fresh data from stale data. It treats all retrieved content with equal confidence.

What makes contamination particularly insidious is that the stale information is often partially correct. A pricing document from six months ago has the right product names and roughly right price ranges. Agent B incorporates it confidently. The error is small enough to survive review but large enough to matter when a customer receives a quote 8% below current pricing.

  • Pro of tool-augmented agents: Dramatically extends the system's knowledge beyond its training cutoff
  • Con: Without freshness metadata and validation, tools become hallucination injection points

Failure Mode 3 — Context Bloat

As the chain deepens, each agent appends its full output to the running conversation history. By Agent 4, the context window contains the complete output of Agents 1, 2, and 3 — much of which is now irrelevant to Agent 4's specific task. The signal-to-noise ratio collapses.

Research on LLM attention mechanisms consistently shows that models are less reliable at extracting information from the middle of long contexts than from the beginning or end — the "lost in the middle" problem.

  • Pro of passing full history: Preserves all prior reasoning for downstream agents
  • Con: LLM attention mechanisms are not equally reliable across 8,000-token contexts

Failure Mode 4 — Role Label Injection

The system constructs inter-agent messages using role labels like "role": "assistant" to pass one agent's output to the next. An attacker — or even a malformed external document retrieved by a tool — can embed text that mimics these role boundaries. The model cannot distinguish the authentic system structure from the injected instruction.

In 2024, security researchers demonstrated that documents retrieved by RAG pipelines could carry embedded role-boundary strings that hijacked downstream agent behaviour. Role labels are not a trust boundary. They are a convention.

Failure Mode 5 — Anchoring Propagation

Agent A makes a judgment call early in the chain — a classification, a risk assessment, a confidence score. Downstream agents receive this judgment as part of their context. Even when the downstream agent's task has nothing to do with that judgment, it influences the output through anchoring.

This is a statistical bias embedded in the architecture. It does not produce obvious errors. It produces outputs that are systematically skewed in the direction of whatever Agent A decided.


The Fix: Treating Inter-Agent Context as a First-Class Engineering Artifact

The engineers who build reliable multi-agent systems do not have better models. They have better handoff discipline. Context passed between agents should be treated with the same rigour as data passed between microservices — typed, validated, versioned, and scoped.

Structured Handoff Schemas

Every agent output that becomes another agent's input should conform to an explicit schema enforced in code before the next agent receives it:

from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime

class AnalysisAgentOutput(BaseModel):
    company_name: str
    revenue_current: float  # always float, never string
    revenue_previous: float
    growth_rate: float
    data_source: str
    retrieved_at: datetime
    confidence_score: float
    exclusions: Optional[str] = None  # preserves qualifiers like "excluding acquisitions"

    @validator('revenue_current', 'revenue_previous')
    def must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Revenue values must be positive')
        return v

    @validator('growth_rate')
    def flag_extreme_growth(cls, v):
        if abs(v) > 0.5:  # flag >50% growth for human review
            raise ValueError(f'Extreme growth rate {v:.1%} requires human verification')
        return v

If Agent 1's output fails this schema, the chain stops. The exclusions field is critical — it preserves the qualifiers that a generic summariser would strip.

Least-Privilege Context Passing

Agent B does not need Agent A's full output. It needs the specific fields relevant to its task:

def build_agent_b_context(agent_a_output: AnalysisAgentOutput) -> dict:
    return {
        "revenue_growth": agent_a_output.growth_rate,
        "comparison_period": "YoY",
        "exclusions_note": agent_a_output.exclusions,
        "data_freshness_days": (datetime.now() - agent_a_output.retrieved_at).days
    }

This eliminates context bloat and prevents anchoring propagation — Agent A's confidence score is deliberately excluded because it should not influence Agent B's independent reasoning.

Freshness Gates on Tool Output

FRESHNESS_THRESHOLDS = {
    "pricing": 1,          # pricing data must be less than 1 day old
    "regulatory": 7,       # regulatory docs acceptable up to 7 days
    "market_data": 0.25,   # market data must be less than 6 hours old
    "company_profile": 30,
}

def freshness_gate(retrieved_data: dict, content_type: str) -> bool:
    max_age_days = FRESHNESS_THRESHOLDS.get(content_type, 7)
    retrieved_at = datetime.fromisoformat(retrieved_data['retrieved_at'])
    age_days = (datetime.now() - retrieved_at).total_seconds() / 86400

    if age_days > max_age_days:
        raise StaleDataError(
            f"Retrieved {content_type} data is {age_days:.1f} days old. "
            f"Maximum allowed: {max_age_days} days."
        )
    return True

Isolation Boundaries for Independent Agents

When agents should reason independently, they should receive identical, clean inputs without visibility into each other's intermediate outputs. This prevents anchoring propagation:

def run_parallel_assessment(data: dict) -> dict:
    clean_input = build_isolated_context(data)

    risk_output = risk_agent.run(clean_input)
    opportunity_output = opportunity_agent.run(clean_input)  # same clean_input

    return synthesise_agent.run({
        "risk_assessment": risk_output,
        "opportunity_assessment": opportunity_output,
        "synthesis_task": "Reconcile and produce balanced recommendation"
    })

Memory Architecture: The Hidden Context Problem

Most multi-agent discussions focus on within-chain context. The harder problem is cross-session memory — what an agent system carries forward from one conversation to the next.

When a multi-agent system has persistent memory, it can accumulate contamination over time. An error that entered the memory store in session 3 can influence sessions 4 through 40 before anyone notices.

Short-term working memory only: Each session starts completely fresh. Zero contamination risk, but cannot learn from prior interactions.

Validated episodic memory: Selected outputs from prior sessions are stored with explicit schema validation and timestamps. Before any stored memory enters a new session's context, it passes through freshness and integrity checks.

Semantic memory with decay: Generalised knowledge extracted from prior sessions is stored as embeddings in a vector store, with relevance-weighted retrieval and time-based decay. Old memories are not deleted — they are downweighted.


Building an Evaluation Harness for Agent Chains

Fixing the handoff is necessary but not sufficient. You also need per-hop evaluation — not just evaluation at the terminal output.

def evaluate_agent_chain(chain_log: list[dict]) -> dict:
    hop_scores = []

    for hop in chain_log:
        score = llm_judge(
            task=hop['agent_task_definition'],
            input=hop['input'],
            output=hop['output']
        )
        hop_scores.append({
            "agent": hop['agent_name'],
            "score": score,
            "schema_valid": hop['schema_validation_passed'],
            "context_tokens": hop['input_token_count'],
        })

    token_trend = [h['context_tokens'] for h in hop_scores]
    bloat_warning = token_trend[-1] > token_trend[0] * 3

    return {
        "chain_quality": min(h['score'] for h in hop_scores),
        "weakest_hop": min(hop_scores, key=lambda x: x['score'])['agent'],
        "context_bloat_detected": bloat_warning,
        "schema_failures": [h['agent'] for h in hop_scores if not h['schema_valid']],
        "hop_detail": hop_scores
    }

The chain quality score is the minimum hop score, not the average. A chain that scores 95%, 94%, 41%, 96% has a chain quality of 41%. The average masks the failure. The minimum exposes it.


The Observability Stack a Production Multi-Agent System Actually Needs

A production-grade observability stack for multi-agent systems has four layers:

Layer 1 — Structured trace logging: Every agent invocation is logged with session ID, hop number, agent name, input token count, output token count, schema validation result, tool calls made, freshness gate results, latency, and model version.

Layer 2 — Per-hop quality metrics: LLM-as-judge evaluation running asynchronously on every hop output, scored against the agent's task definition. Schema validation pass rates tracked as time series — degradation here is an early warning signal that always precedes output quality degradation.

Layer 3 — Chain-level anomaly detection: Automated detection of context bloat, schema failure clusters, and output divergence from historical distribution.

Layer 4 — Human review sampling: Systematic random sampling of 1–2% of full chain traces for human review, weighted toward chains that triggered anomaly detection alerts. This is the ground truth calibration layer.


Understanding why multi-agent systems hallucinate is the entry point. The harder questions — how do you design a memory architecture that persists validated state across sessions? How do you build an orchestration layer that routes tasks between agents dynamically while maintaining context integrity? How does the least-privilege context principle change when you move from three agents to fifteen? — are what separate production-ready systems from demos.

These questions do not have clean textbook answers. They have tradeoffs, failure patterns, and architectural decisions that only become clear when you have built something and watched it break in a real environment.

At Meritshot's Data Science with Agentic AI program, multi-agent architecture is taught through production case studies — systems built by real teams, the exact context failure that caused them to behave incorrectly, and the engineering decisions that resolved it.

Recommended