AI POC Evaluation: How to Know If a Pilot Is Worth Scaling

6 min read · Updated 2026-05-02

Runrate Framework

5-Stage AI Cost Maturity Curve

From Invisible → Tracked → Allocated → Optimized → Governed — where does your org sit?

Read the full framework →

According to MIT's GenAI Divide study, 95% of AI pilots fail to deliver material P&L impact. This isn't because the technology doesn't work; it's because success was never defined. A POC that doesn't have a pre-agreed cost-per-outcome baseline is not a POC—it's a confidence game.

Use this framework to evaluate whether your AI pilot is worth scaling. Answer five critical questions. If you answer "yes" to all five, scale. If you answer "no" to more than one, renegotiate or stop the pilot.

Question 1: Is cost-per-outcome within your baseline?

By week 6 of your POC, measure actual cost-per-outcome against your pre-agreed baseline.

Example: You and the vendor agreed upfront that cost per resolved support ticket would be $0.55. Week 6 actual: $0.48. That's 13% better than baseline. You passed this gate.

Failure case: Week 6 actual: $0.78. That's 42% worse than baseline. Stop here. Either renegotiate the contract or terminate the POC.

What to measure: Total cost to the vendor (or invoice to you) divided by total work items successfully resolved. Don't measure cost per API call or cost per token; measure cost per business outcome.

Common failure mode: The vendor's benchmark is accurate for a narrow use case (e.g., password resets, which are simple), but your actual mix includes more complex tickets (account reconciliation, billing disputes) that cost more to resolve. This is why you need to define "typical work item" before the POC starts.

Question 2: Is attribution data exposed and trustworthy?

By week 6, you should have a cost-per-work-item dashboard running. You should be able to pull up a specific ticket and see:

  • Vendor's cost for that ticket (tokens, API calls, retries, tool calls)
  • Your internal cost (any human review time)
  • Outcome (resolved, escalated, failed)

If the vendor can't show you this data, they don't have attribution built in. This is a red flag. Stop the POC or downgrade the vendor's score.

What to measure: Data freshness (is cost available real-time or daily?), completeness (do all tickets have cost data?), and accuracy (does vendor-reported cost match your infrastructure observability logs?).

Common failure mode: The vendor has a cost dashboard, but it's aggregated ("Total cost: $15,432 this week"). You can't join vendor cost with your outcome data. This makes it impossible to measure ROI or optimize later. Insist on work-item-level attribution.

Question 3: Is integration debt acknowledged and remediated?

Integration debt is latency, failure, and cost drift introduced by connecting the agent to your systems.

By week 6, measure:

  • Latency: How long does a call from the agent to your CRM take? (Typical: <1 second; acceptable: <2 seconds; unacceptable: >5 seconds)
  • Failure rate: How often do API calls to your systems fail or time out? (Typical: <0.1%; acceptable: <1%; unacceptable: >5%)
  • Cost drift: Does the vendor's cost estimate include tool calls to your CRM, database, and third-party APIs? (Critical: if not, cost is understated)

If integration latency is high (>5 seconds per call), the agent will time out or retry, driving cost up. If failure rate is high (>5%), the agent will escalate tickets or fail to resolve, reducing effectiveness. If cost drift is hidden, your actual cost-per-outcome is 2–3x higher than the vendor's baseline.

What to measure: Trace logs from your integrations. Peak integration latency. Failure rate percentiles. Cost of all API calls made during a resolved ticket.

Common failure mode: The agent resolves a ticket in 30 seconds. But the 30 seconds includes 5 calls to your CRM (5 seconds each = 25 seconds), plus 2 retries due to timeout, plus 1 escalation. The vendor's benchmark of "15 seconds inference time" ignores the 25 seconds of integration overhead. Your actual end-to-end latency is 50+ seconds.

Question 4: Is a change management plan in place?

The agent is only valuable if your team actually uses it.

By week 6, measure:

  • Team adoption: Are support agents routing tickets to the agent as intended, or are they bypassing it?
  • Quality feedback: Are agents giving the vendor feedback on bad resolutions? Is the vendor tuning the agent based on feedback?
  • Incentive alignment: Are your support team's KPIs still measured on tickets resolved (which incentivizes them to take all tickets), or have they been updated to "first contact resolution rate" or "escalation rate" (which incentivizes using the agent)?

If adoption is low (<50% of eligible tickets), the POC won't measure real-world value.

What to measure: Percentage of eligible tickets routed to the agent. Feedback loop: is the vendor receiving and acting on support team feedback? Team NPS on the agent: would you recommend using this agent?

Common failure mode: The agent is live but your support team doesn't trust it. They bypass it and handle tickets manually. Two months in, you have zero scaling data. The POC fails not because the agent doesn't work, but because your team didn't adopt it.

Question 5: Are scaling triggers and volume forecasts realistic?

Before you scale, define exactly what will trigger expansion: "If cost-per-outcome stays below $0.60 for 4 weeks, and escalation rate stays below 5%, we scale to 20,000 tickets/month."

By week 6, forecast your scale-up curve:

  • What volume will you move to the agent over the next 6 months?
  • How will that volume growth affect cost-per-outcome (does cost increase with complexity)?
  • What's the ramp timeline?

Failure case: Week 6 data looks great (cost = $0.55, escalation = 2%). You decide to 10x volume immediately. Two weeks into scale-up, cost jumps to $1.20 per ticket because the agent is hitting rate limits, retries are spiking, and your team is overwhelmed with escalations. You didn't define a ramp plan.

What to measure: 6-week trend data: Is cost-per-outcome stable, improving, or degrading? Is the trend pointing toward scalability or toward cost increases as volume rises?

Common failure mode: POC is run at 1,000 tickets/month. Everything looks great. You scale to 10,000/month. Infrastructure breaks, team becomes bottleneck, cost-per-outcome doubles. You should have ramped to 5,000/month first, tested, then ramped to 10,000.

The POC evaluation checklist

At week 6 (mid-POC), use this checklist:

  • [ ] Cost-per-outcome within baseline ±15%
  • [ ] Attribution data is real-time and work-item-level
  • [ ] Integration debt has been measured and is <10% of total latency
  • [ ] Change management plan includes incentive alignment and feedback loops
  • [ ] Scaling trigger is defined and will be measured at week 10
  • [ ] Vendor is responsive to feedback and tuning the agent weekly
  • [ ] Support team NPS on the agent is >6/10
  • [ ] ROI calculation shows payback within 6–12 months of scaling

If six or more boxes are checked: Continue to week 10 and prepare to scale. If four to five boxes are checked: Continue to week 10 but negotiate cost targets or ramp plan. If three or fewer boxes are checked: Renegotiate with the vendor or terminate the POC.

The week 10 decision

At week 10, measure again. This time, add:

  • 8-week trend: Is cost-per-outcome stable? Is escalation rate stable?
  • Outcome quality: Of the tickets resolved by the agent, what % require human follow-up due to errors?
  • Customer satisfaction: Are customers satisfied with agent resolutions? (If the agent resolves 95% of tickets but customers are unhappy, scaling won't work.)

If all metrics are stable or improving, scale. If metrics are degrading, renegotiate or stop.

The key insight: A POC that doesn't have cost-per-outcome and attribution gates is not a POC—it's a pilot. Pilots are useful for learning, but they don't tell you whether something will work at scale. A POC is supposed to tell you "yes, we should scale" or "no, we should stop." If you don't know the answer at week 10, the POC was poorly designed.

For the full vendor evaluation process, see "How to Buy AI: The Executive's Vendor Evaluation Guide."

Go deeper with the field guide.

A step-by-step PDF for implementing AI cost attribution.

Download the Guide

Was this article helpful?