Why 95% of AI Pilots Fail to Deliver P&L Impact

8 min read · Updated 2026-05-02

Runrate Framework

The AI Cost Iceberg

Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).

Read the full framework →

MIT NANDA's "GenAI Divide" research found that 95% of AI pilots delivered zero measurable P&L impact. Not "some," not "minimal"—zero. The organizations that failed didn't lack good technology or smart people. They lacked something far simpler: a measurement infrastructure that tied AI cost to business outcome. This article breaks down the five reasons pilots fail and what the 5% who succeed do differently.

The MIT NANDA Finding

In 2024–2025, MIT NANDA interviewed 1,420 executives at organizations running AI pilots. The study asked a straightforward question: Did your pilot deliver measurable profit-and-loss impact?

The answer was blunt: 95% said no. They deployed an AI agent, ran it for three months, and when asked "did it make money or save money," they couldn't point to a concrete number.

This isn't unique to a single industry, company size, or AI use case. The pattern holds across healthcare, insurance, financial services, contact centers, and professional services. The common thread: measurement blindness.

McKinsey's State of AI 2025 showed the same pattern at scale. While 88% of large organizations use AI in at least one function, only 39% see EBIT impact, and only 5.5% qualify as "high performers" who can repeatedly create value from AI. The gap isn't between companies that bought expensive AI and companies that bought cheap AI. It's between companies that measure AI outcomes and companies that don't.

Root Cause #1: No Baseline Measurement

The most common failure mode is simple: the team never measured cost-per-outcome before deploying the agent.

Here's how it usually goes: A operations leader says, "We're doing 500 customer support tickets per month with a team of 3 people. Let's deploy an AI agent to help." The team builds the agent, deploys it, runs it for 60 days. Then the CFO asks: "Did it work?"

And the operations leader realizes: "I actually don't know what it cost to resolve a ticket before we had AI."

Without a baseline, you can't calculate ROI. You can't even define "worked." If your baseline was $22 per ticket and AI brought it to $16, that's a win. If your baseline was $12 and AI brought it to $18, that's a loss. But if you never measured baseline, you're just guessing.

The 5% always start with a cost-per-outcome baseline. They spend 2-4 weeks interviewing operations, pulling P&L line items, and calculating: "Today, it costs us X dollars to resolve a ticket / adjudicate a claim / process a loan." Only then do they deploy the agent. Then they measure again. The math becomes obvious.

Root Cause #2: Attribution Gap

Even when a pilot clearly reduces headcount or time spent, the financial benefit doesn't flow through the P&L. Here's why.

The AI agent's cost sits on one line: an "AI Infrastructure" or "Technology" line item. The human headcount reduction sits on another line: "Customer Support" or "Claims Processing." Finance sees them as unrelated.

In the best case, the team presents a 30-page memo with three scenarios and a sensitivity table, and the CFO nods politely and files it away. In the worst case, the memo lands on the CFO's desk right as the CFO is looking at a 10% budget cut across all departments, and the AI savings get allocated to the budget cut, not retained as a win.

The problem is attribution. The AI cost is attributed to technology infrastructure. The outcome (the resolved ticket, the adjudicated claim) is attributed to a specific P&L line. But there's no bridge between them. The system can't automatically connect "this AI agent cost $3.50 to resolve this ticket" to "this ticket belongs in customer support P&L."

The 5% solve this with an integrated measurement system. They don't separate "what did we spend on AI?" from "what did we accomplish?" They tie cost to outcome at the work-item level. The infrastructure answers: "That ticket was handled by the AI agent, which cost $3.50 on infrastructure and $2.40 in human review time, for a total of $5.90. The baseline would have been $20. Net savings: $14.10."

When the CFO sees that granularity, the ROI becomes undeniable.

Root Cause #3: Hidden Cost Blindness

A team launches a pilot and celebrates: "Our AI agent handles tickets at just $0.02 per token—that's cheap!"

Three months later, the pilot's "cheap" cost has doubled or tripled, and no one knows why.

The problem is the AI Cost Iceberg. The visible cost—the OpenAI or Anthropic invoice—is about 10% of the true cost. The hidden costs include:

  • Inference overhead at scale. A single token call at $0.003 is cheap. But at 50,000 daily requests, that's $4,500/month. And when volume spikes, you add rate-limit infrastructure ($500/month), caching systems ($1,500/month), and monitoring ($800/month).
  • Retries on failure. When an AI model hallucinates or errors, the system retries. Some teams don't even measure retry rates and are unknowingly paying 2x the API cost they think they're paying.
  • Human-in-the-loop review. Most production agents require humans to review, approve, or correct decisions. That's not an "edge case"—that's the cost structure. If your agent reaches 88% accuracy and 30% of cases require human review, your effective cost is API cost ÷ (1 - escalation rate) = API cost ÷ 0.7 = 1.43x the visible cost.
  • Vector databases and embedding storage. If you're using RAG (retrieval-augmented generation) to give the agent context, you're paying for vector database storage, embedding inference, and retrieval latency. These are often invisible line items.
  • Tool calls to third-party services. If your agent calls Stripe, Twilio, Salesforce, or an internal API to take action, each call is metered. Thousands of small calls add up.
  • Observability and monitoring. Logging, tracing, and monitoring an AI pipeline at scale requires infrastructure that's separate from the AI cost itself.

Most pilot teams measure only the visible cost and get blindsided when hidden costs emerge at scale.

The 5% build a cost model that includes all nine layers of the AI Cost Iceberg before they deploy. They forecast: "At scale, we expect 25% retries, 15% human review overhead, $500/month in observability, $2,000/month in vector database cost." When reality emerges, it usually confirms the forecast. If it doesn't, they investigate and adjust quickly.

Root Cause #4: Sandbox-to-Production Gap

A pilot runs in a controlled environment with curated data, low volume, and high user attention. Then it moves to production, where it meets reality.

The agent that achieved 98% accuracy on 100 sample tickets achieves 81% accuracy on 5,000 production tickets. The human review rate that was 5% in the pilot becomes 35% in production. The cost per ticket, which looked great in the forecast, suddenly looks bad.

What changed? Scale. Edge cases. User behavior outside the training set. Real data noise.

Most teams see this gap and don't have a good mental model for it. They blame the model, or the vendor, or bad luck. They didn't plan for it because they didn't measure pilot performance against a rigorous standard.

The 5% test at scale within the pilot phase. They don't run 100 test cases and call it proven. They run the agent on a statistically representative sample (maybe 10% of production volume) for 4-6 weeks, measure accuracy and cost at that scale, and build the forecast from reality, not theory. When they move to 100% production volume, they already know what to expect.

Root Cause #5: No Feedback Loop

The pilot ends. The agent goes into production. The team moves to the next project. And cost-per-outcome is never measured again.

This is the most insidious failure mode because it's invisible. Six months after deployment, the model has drifted, the human review rate has crept up from 15% to 28%, the inference latency has increased (because the vector database hasn't been optimized), and the cost per outcome is now $16 instead of the forecast $8.

But no one knows. There's no dashboard. There's no weekly report. There's no feedback loop that says "your agent's economics changed."

The 5% treat AI agents like they treat employees. They measure performance continuously. Weekly dashboards show: cost per outcome, accuracy, escalation rate, latency, human review time. If any metric drifts 10% or more, an alert fires. If cost per outcome is trending up, the team investigates. If accuracy is declining, they retrain or adjust the prompt. They don't set it and forget it.

What the 5% Do Differently

The organizations that nail AI ROI share a repeatable pattern:

1. Measure baseline before deployment. They calculate cost-per-outcome for the manual process. This number is their control case.

2. Build a cost model that includes the full iceberg. They don't pretend the visible API cost is the total cost. They model retries, human review, infrastructure, vector databases, tool calls—everything.

3. Test at production scale. They run the pilot on a representative sample of real data at real volume for 4-6 weeks. They measure accuracy, cost, and escalation rate at scale. Only then do they forecast ROI.

4. Tie cost to outcome. They build integration with their finance system so that every work item (ticket, claim, loan) carries a cost tag: "This was handled by AI agent X at a cost of $Y, vs. baseline cost of $Z."

5. Measure continuously after deployment. They implement dashboards that track cost per outcome, accuracy, escalation, and latency weekly. They set alert thresholds. If metrics drift, they investigate within days, not months.

Anthropic's Economic Index and McKinsey's research on high performers both confirm the same pattern: the organizations that extract consistent value from AI are the ones that measure obsessively. They don't assume ROI; they prove it.

Why This Matters for the CFO

If you're a CFO who's been burned by AI pilots, this is why. It wasn't the AI technology that failed. It was the measurement infrastructure. You asked the question "did this create value?" and your organization had no system to answer it.

The good news: this is fixable. It doesn't require expensive new software (though Runrate helps). It requires clarity on three numbers: baseline cost per outcome, AI cost per outcome (including all hidden costs), and actual volume at scale. With those three numbers, ROI is arithmetic. And arithmetic is defensible to a board.

Read How to Actually Measure AI ROI (With Numbers) for the full measurement framework and worked examples.

Go deeper with the field guide.

A step-by-step PDF for implementing AI cost attribution.

Download the Guide

Was this article helpful?