Data Science

The Data Pipeline You Trust Is the First Thing Agents Break

Your data pipeline has been running cleanly for eighteen months. Then you wire an AI agent into it. The pipeline didn't break — the agent exposed what was always there. Here's the complete failure taxonomy and what you need to engineer differently.

Meritshot16 min read
Data EngineeringAI AgentsData PipelinesData ScienceProduction AI
Back to Blog

The Data Pipeline You Trust Is the First Thing Agents Break

The Pipeline Worked Fine. Until the Agent Showed Up.

Your data pipeline has been running cleanly for eighteen months. SLA adherence above 99%. Schema validation passing. Downstream dashboards refreshing on schedule. Your data engineering team is proud of it — and they should be.

Then you wire an AI agent into it.

Within two weeks, something nobody can immediately explain starts happening: the agent is producing outputs that are subtly, consistently wrong. Not catastrophically wrong — wrong in the way that only becomes visible when a domain expert squints at the numbers. The pipeline metrics look identical to what they were before the agent arrived. No failed jobs. No schema violations in the alerting system. No data quality checks tripped.

The pipeline didn't break. The agent exposed what was always there.

This is the core dynamic that most teams discover the hard way: a data pipeline designed and validated for human-facing analytics is not the same thing as a data pipeline that reliably feeds an autonomous agent. The assumptions baked into the first design — tolerances, latency windows, schema flexibilities, null handling conventions — become active failure vectors in the second context. The agent doesn't adapt to pipeline quirks the way a human analyst does. It operationalizes them.


Why Pipelines Built for Analysts Fail Agents

Human analysts have an enormous capacity for tacit data interpretation. When a dashboard shows a revenue figure that seems off for a Monday morning, a senior analyst thinks: "Oh, that's probably the weekend lag — our payment processor batches Saturday and Sunday transactions together and they hit the warehouse Monday at 6 AM. The number will normalize by noon."

That reasoning happens implicitly, instantly, and invisibly. The analyst never documents it. Nobody builds a data quality check for it.

The agent has no such knowledge. It reads the Monday morning revenue figure, treats it as ground truth, and acts on it. If it's an inventory planning agent, it may trigger a procurement decision based on apparently depressed weekend revenue. If it's a customer segmentation agent, it may downgrade high-value customers who appear to have gone quiet when they're simply in the batch-processing gap.

The three categories of pipeline assumptions that agents break:

  • Temporal assumptions — data that reflects a point-in-time state that is not current, treated by the agent as present reality
  • Semantic assumptions — fields whose meaning has evolved over time but whose schema hasn't, where historical and current records use the same field to mean different things
  • Completeness assumptions — pipelines that surface partial data during load windows, where a human analyst knows to wait but an agent reads and acts on the partial state

None of these are bugs in the traditional sense. They're design decisions made for a context — human analysts with institutional knowledge — that the agent doesn't share.


Schema Drift: The Silent Corruption

Schema drift is well-understood in data engineering. Fields get renamed. Types change. Nullable columns become required. Every mature pipeline has mechanisms for handling this — schema registries, contract testing, backward compatibility policies.

The problem is that these mechanisms were designed to prevent downstream breakage in known consumers. Reports break. Dashboards break. These failures are visible and recoverable.

When an agent is the downstream consumer, schema drift breaks differently.

Real scenario: A retail operations agent monitors inventory across 340 SKUs and triggers reorder requests when stock levels drop below thresholds. It reads from a warehouse management system that has been running for six years.

Four months ago, the warehouse team quietly renamed the available_units field to sellable_units to better reflect that some units are reserved for returns processing. The old field was kept in the schema for backward compatibility, but it now reflects total physical units, not sellable units.

The agent was configured to read available_units. Nobody updated the agent configuration when the field semantics changed, because nobody thought of the agent as a downstream consumer that needed to be in the schema change review process.

The agent is now reading inflated inventory numbers — total physical units, including reserve stock. Its reorder thresholds are never triggered. Stockouts begin appearing three weeks later.

Why this is harder to catch than it looks:

The available_units field is still present. It's still returning numbers. The agent isn't receiving nulls or type errors — it's receiving plausible, reasonable-looking numbers that happen to mean something different. No schema validation check catches semantic drift. The contract test passes because the field exists and is the right type.

What actually works:

  • Semantic versioning for pipeline fields, not just schemas. When a field's meaning changes — even if its name and type are preserved — that's a semantic breaking change that should trigger a version bump and consumer notification.
  • Agent consumer registration in schema change review. Every agent that reads from a pipeline should be listed as a named downstream consumer in the schema registry.
  • Lineage tagging on agent reads. Every field the agent accesses should be logged with a lineage tag — not just "agent read from pipeline X" but "agent read field available_units from warehouse_inventory at version 1.4.2 at 14:32 UTC."

Data Freshness: When "Current" Isn't

Every pipeline has a freshness characteristic — a lag between when something happens in the real world and when it's reflected in the data the agent reads. For a human analyst running a weekly report, a six-hour data lag is irrelevant. For an agent making decisions in response to real-time conditions, a six-hour lag can be the difference between a correct action and a deeply wrong one.

Real scenario: A customer success platform deploys an agent to monitor health scores across enterprise accounts and proactively trigger intervention workflows when scores drop. The agent reads from a data warehouse that aggregates CRM activity, product usage telemetry, and support ticket data.

Unknown to the agent: the product usage telemetry pipeline has an ingestion delay that varies between 2 and 18 hours depending on the telemetry source. A customer who has been actively using the product all day may appear to have zero product activity in the warehouse until their telemetry batch processes at midnight.

The agent, reading at 4 PM, sees an account with no product usage for the day. It classifies this as a churn risk signal and triggers an automated intervention email: "We noticed you haven't been using [Product] recently — is everything okay?"

The customer, who has been using the product actively all day, receives this email and is confused and mildly insulted.

The freshness problem has three distinct dimensions:

DimensionWhat It MeansAgent Failure Mode
Known lagPipeline has a documented, predictable delayAgent acts on stale data it could have been warned about
Variable lagPipeline delay varies based on source behaviour or batch timingAgent has no way to predict or detect staleness
Silent failure lagPipeline appears to be running but source system stopped sending dataAgent treats absence of data as meaningful signal

What actually works:

Every dataset the agent reads should carry a data_as_of timestamp — not the pipeline run timestamp, but the timestamp of the most recent event reflected in that dataset. The agent should be explicitly designed to reason about this timestamp: Is the data fresh enough to act on? What's the maximum acceptable lag for this decision type?

This requires two things most pipelines don't currently provide:

  1. Event-time tracking — the pipeline must track and surface the maximum event time in the current batch, not just the wall clock time of the pipeline run
  2. Freshness SLAs per dataset — documented, machine-readable thresholds that the agent can reference to evaluate whether the data it's reading is within acceptable freshness bounds

Null Handling: The Convention That Became a Landmine

Every data engineer has strong opinions about nulls. The problem is that those opinions vary by team, by table, by era of the pipeline's construction — and the conventions are almost never written down anywhere an agent can read them.

In a mature data warehouse, you will commonly find:

  • Tables where null means "unknown" (no data was ever collected)
  • Tables where null means "not applicable" (the field doesn't apply to this record type)
  • Tables where null means "zero" (a deliberate convention adopted to save storage in sparse tables)
  • Tables where null means "the ETL job failed for this record" (a bug that was never fixed, now preserved for historical consistency)

Real scenario: A marketing automation agent reads from a customer data platform to identify high-potential leads for outbound campaigns. One of its signals is last_purchase_value. In the customer table, null in this field means the customer has never made a purchase.

The agent was designed to filter for customers with last_purchase_value > 1000. Because it's filtering for values greater than a threshold, null records are automatically excluded. This seems correct.

But there's a second table — the campaign response table — where last_purchase_value was populated differently during a legacy migration. In that table, null means the field wasn't migrated for records before 2019, not that the customer hasn't purchased. Some of these pre-2019 customers are actually the highest-value accounts in the database. The agent never sees them.

The campaign targets medium-value customers and misses the highest-value segment entirely. Nobody notices until the campaign ROI report is reviewed three months later.

What actually works:

  • Null semantics documentation as a machine-readable schema annotation — not a comment in a SQL file, but a structured metadata field in the schema registry that the agent can read and reason about before consuming the field
  • Explicit null handling logic per field in agent tool definitions — when an agent tool reads a field that can be null, the tool definition should specify how that null is to be interpreted for this specific field
  • Cross-table null convention audits before agent onboarding — before connecting any agent to a multi-table data environment, run a null convention audit across every field the agent will read

Duplicate Records: The Error That Compounds in Agent Loops

Duplicate data in a warehouse is a known, managed problem for analysts. They know to deduplicate before aggregating. Agents don't adjust. More precisely: agents compound.

When an agent reads a dataset with duplicates and uses that data to make a decision, the decision is based on inflated or repeated signals. When that decision triggers an action that feeds back into the pipeline — which is the case in any agentic system with tool use — duplicates don't just affect one decision. They propagate.

Real scenario: A logistics agent monitors freight shipment status and sends alerts when shipments are at risk of delay. It reads from an event streaming pipeline that aggregates status updates from three carrier APIs. One carrier sends duplicate status events for the same shipment — a known quirk that the data engineering team handles downstream via a deduplication step.

But the agent reads from the raw event stream, upstream of the deduplication step. The same shipment status event appears twice. The agent processes it twice, generating two delay alerts for the same shipment. The operations team receives duplicate alerts and begins dispatching two separate responses to the same shipment issue.

Two teams spend time on the same incident. The operations manager reviews the alert volume at end of month and flags an apparent spike in freight delays — when in reality, the shipment count was unchanged; the alert count doubled due to upstream duplicates.

What actually works:

  • Agent reads should always happen downstream of the deduplication step — agents are frequently connected to raw data streams for latency reasons; the performance cost of deduplication should be evaluated against the correctness cost of skipping it
  • Idempotency keys in agent tool outputs — every action the agent takes should carry an idempotency key derived from the source record identifier
  • Alert deduplication at the delivery layer — for agent-generated notifications, implement deduplication at the delivery layer using a rolling time window

The Aggregation Timing Problem: When Your Metrics Are Snapshots, Not Streams

Most warehouse data isn't raw events — it's aggregated metrics. Daily revenue. Weekly active users. Monthly churn rate. These aggregations are computed at a specific point in time and written to the warehouse. They're snapshots.

For a human analyst running a monthly business review, snapshots are perfectly appropriate. For an agent making decisions throughout the day, a snapshot taken at midnight may be meaningfully wrong by 3 PM.

Real scenario: A pricing agent for an e-commerce platform adjusts product prices based on competitive positioning and demand signals. One of its inputs is a daily snapshot of competitor prices, scraped and aggregated overnight and written to the warehouse at 2 AM.

A competitor runs a flash sale starting at 11 AM. By noon, their prices on ten overlapping SKUs are 22% lower than what the pricing agent sees in its 2 AM snapshot. The agent, unaware of the price change, maintains the platform's existing prices. Conversion rate on those ten SKUs drops sharply through the afternoon.

The agent isn't broken. It's doing exactly what it was designed to do. But the data it was designed around is the wrong data type for an agent making pricing decisions throughout the day.

Data TypeAppropriate forProblematic when fed to
Daily snapshotMorning business review, weekly reportingAgents making intra-day decisions
Weekly aggregateStrategic planning, trend analysisAgents reacting to current-state signals
Batch-computed metricPost-hoc analysis, historical comparisonReal-time agents requiring event-driven data

What actually works:

Before wiring an agent to any data source, classify the data's temporal granularity — not just its refresh frequency, but the actual event resolution it represents. Then ask: does the agent's decision cadence match the data's event resolution? If the agent makes decisions every hour and the data updates once per day, either the data needs to become more granular, or the agent's decision cadence needs to be bounded.


Volume Shocks: When the Agent Becomes the Surge

Here's a failure mode that data engineers understand instantly and AI engineers discover with surprise: agents generate query load patterns that look nothing like any prior consumer of the pipeline.

A human analyst runs a query. The query returns results. The analyst reads them. Maybe they run another query in ten minutes. The interaction pattern is slow, human-paced, and easily absorbed by standard warehouse capacity.

An agent runs a query. The query returns results. The agent reasons about them. The agent decides it needs more data — so it runs another query. And another. Across hundreds of concurrent sessions, an agent that makes four to eight data calls per session can generate query load that exceeds the warehouse's designed capacity within hours of launch.

Real scenario: A financial advisory agent helps retail investors review their portfolio against market conditions. Each session involves eight warehouse queries. At launch, the product team projected 500 daily active users.

On launch day, due to a successful press mention, 4,200 users arrived. The agent ran 8 queries per session. The warehouse received 33,600 queries in a four-hour window — against a system designed for roughly 4,000 queries per day from human analysts. The warehouse began throttling. Query latency climbed from 400ms to 14 seconds. Sessions began failing mid-flow.

The specific engineering responses that matter:

  • Query result caching at the agent layer. If 4,200 users all ask about the S&P 500 index level in a two-hour window, the warehouse should answer that question once, not 4,200 times.
  • Read replicas specifically designated for agent traffic. Agent query load should be isolated from analyst and application query load.
  • Per-session query budgets with circuit breakers. Define a maximum number of warehouse queries any single agent session can make.
  • Load testing at machine-speed patterns, not human-speed patterns. Before launch, run load tests that simulate agent query patterns — multiple queries per second, concurrent sessions, no human pacing.

Pipeline Observability for Agent Workloads: A Different Set of Questions

Standard pipeline monitoring answers: Did the job run? Did it complete on time? Did it produce the expected row count?

For agent-fed pipelines, those questions are necessary but not sufficient. The additional questions that matter:

Did the data the agent read reflect the state of the world the agent was asked to reason about? This is a freshness and event-time question, not a job completion question.

Did the agent read data from a consistent point in time across all the tables it joined? In a warehouse where different tables refresh at different frequencies, a join may combine data from two different moments in time.

Did the null conventions, semantic field definitions, and deduplication state the agent encountered match the agent's configured expectations? This requires lineage tracking at the field level, not just at the table level.

The observability additions that agent pipelines require:

  • Event-time watermarks surfaced as queryable metadata on every dataset
  • Cross-table snapshot consistency checks — verification that all tables used in a given agent session reflect data from the same or compatible time windows
  • Agent query pattern monitoring — separate dashboards tracking agent-generated query volume, latency, and cache hit rates
  • Field-level lineage logs tracking which agent read which field from which table version at which time

The Engineering Mindset Shift: Pipelines as Agent Infrastructure

The fundamental change that production AI workloads require is a reclassification of the data pipeline — from a reporting asset to an operational infrastructure component.

Reporting assets can tolerate known imperfections. Analysts adapt. Operational infrastructure components must be explicitly correct, because the systems running on top of them take automated actions without adaptation.

This reclassification changes every design decision downstream:

  • Freshness SLAs become hard constraints, not guidelines
  • Null conventions become machine-readable metadata, not institutional knowledge
  • Schema changes require agent impact assessment, not just BI impact assessment
  • Query capacity planning includes machine-speed load profiles, not just human usage patterns
  • Duplicate handling moves upstream of agent reads, not downstream in reporting logic

The data engineers who make this transition successfully aren't doing different technical work — they're applying the same rigour to a different class of consumer. What changes is the assumption that there's a human in the loop who will catch what the pipeline misses.

There isn't. And that changes everything.


At Meritshot, the Data Science program is built around exactly this convergence of data engineering and AI engineering. The pipeline failures described in this article aren't presented as warning slides — they're reproduced in live lab environments, diagnosed with real observability tooling, and resolved using the engineering patterns that actually hold in production. Mentors who teach these modules have run data pipelines that fed production ML systems and agent workloads, watched them break in the ways described here, and built the fixes from scratch.

Recommended