KPIs for AI Agents: The 12 Metrics That Actually Matter

9 min read · Updated 2026-05-02

Runrate Framework

AI Workforce P&L

Treat AI agents like employees: cost structure, productivity target, and retirement trigger per agent.

Read the full framework →

Most AI teams measure the wrong metrics. They track accuracy, latency, and token count because those are easy to log. But a CFO doesn't care if an AI agent resolved a ticket in 400ms or 600ms. They care whether it reduced the cost per outcome and improved margin. This article identifies the 12 KPIs that actually drive business value—the metrics the 5% of high-performing AI organizations measure religiously. Skip vanity metrics. Track these twelve instead.

Why Vanity Metrics Fail

Accuracy is not a KPI. Latency is not a KPI. Token efficiency is not a KPI—unless it drives down cost per outcome.

Here's the distinction: a metric measures something. A KPI measures something that drives business value.

An AI team celebrating 94% accuracy without measuring escalation rate or human review cost is celebrating a vanity metric. The agent might be accurate in a lab, but at scale, escalation overhead might make it less economical than a human. A team celebrating 400ms latency while their cost per outcome is $18 (vs. baseline $20) is optimizing for speed while ignoring margin.

The 12 KPIs below are the ones that matter because they directly inform the CFO's core question: Is this agent creating economic value, and if so, how much?

The 12 KPIs

1. Cost Per Resolved Work Item

This is the north star. Everything else is a diagnostic.

Definition: (Total AI spend + human review cost + infrastructure cost) / (number of outcomes completed)

Why it matters: This is the metric the CFO uses to measure ROI. It answers the question "what does this agent cost to deliver an outcome?" If your baseline cost per ticket is $20 and your AI cost is $6, you have an 70% cost reduction. If it's $24, you're losing money.

How to measure: Divide your total monthly cost by the number of completed work items. Include all cost layers: API, retries, human review, infrastructure, observability, vector database. If you're missing a layer, your number is wrong.

Target: For high-performer organizations, cost per outcome with AI is 30-60% of baseline. If you're at 80% of baseline (only 20% reduction), you need to ask why. Is escalation higher than forecast? Is human review eating the savings? Is API cost higher than expected?

2. Resolution Rate (First-Touch)

What percent of work items are resolved by the AI agent without human escalation?

Definition: (Work items resolved by AI without escalation) / (total work items processed) × 100

Why it matters: First-touch resolution (FTR) determines whether your agent saves money. An agent that resolves 50% of tickets without escalation has very different economics than an agent that resolves 85%. FTR directly impacts the cost per outcome calculation.

How to measure: Log every work item and mark whether it was resolved by AI or escalated. Track weekly.

Target: Most successful pilots land at 65-80% first-touch resolution in production. If you're above 85%, you might be missing edge cases that your human team would catch. If you're below 60%, your escalation overhead is eating your savings.

3. Escalation Rate to Human

The inverse of first-touch resolution—what percent of work items require human intervention?

Definition: (Work items escalated to human) / (total work items processed) × 100

Why it matters: Escalation is a direct cost driver. Every escalated item requires human time, which reverses some of your AI savings. High escalation rates (>40%) often signal that the agent isn't ready for production or needs retraining.

How to measure: Log escalations weekly. Break them down by reason: accuracy issue, edge case, policy exception, user request for human review.

Target: 20-35% escalation is realistic for first-generation agents in complex domains. If you hit >50%, the agent is adding cost. If you're <10%, be skeptical—you might be missing escalations.

4. Human Review Time Per Escalation

Of the escalated items, how long does it take a human to review or complete them?

Definition: Total human review time on escalated items / number of escalated items

Why it matters: This is where hidden cost lives. If escalation seems cheap at 15% but each escalation requires 15 minutes of human time, the true cost is much higher. Human review time directly impacts cost per outcome.

How to measure: Track human time spent on AI-escalated items. Some systems log this automatically; others require time tracking.

Target: 2-5 minutes of human review per escalation for well-designed agents. If you're at 10+ minutes, the agent is creating extra work, not saving it.

5. AI Margin Contribution

What margin (revenue minus cost) does the agent generate?

Definition: (Revenue attributable to agent's completed work) - (AI cost + human review cost + infrastructure cost)

Why it matters: This is the frame the CFO uses for the board. It's not enough to say "we reduced cost." You need to quantify the economic value. "Our AI customer support agent generated $180,000 in margin contribution in Q4" is a board-grade statement.

How to measure: Tie each completed work item to its revenue impact or cost avoidance. Then subtract all costs (AI + human + infrastructure).

Target: Margin contribution should be positive from year 1, month 3-4 if the deployment is working. If you're still losing money in month 6, investigate why.

6. Token Efficiency (Tokens Per Outcome)

How many tokens does the agent consume to deliver one outcome?

Definition: Total tokens (input + output) used per work item

Why it matters: Token efficiency drives inference cost. An agent that uses 2,000 tokens per ticket is cheaper than one that uses 8,000 tokens per ticket, all else equal. Tracking this metric helps you optimize prompts and reduce API cost.

How to measure: Log tokens per API call and aggregate by work item. Track weekly trends.

Target: Aim for <3,000 tokens per outcome for most use cases. If you're >5,000 tokens, your prompts might be too verbose or you're making too many API calls per outcome.

7. Cache Hit Rate

What percent of API calls benefit from prompt caching?

Definition: (API calls using cached context) / (total API calls) × 100

Why it matters: Prompt caching (available in Claude and GPT-4) can reduce API cost by 80-90% on cache hits. If you're not using it, you're paying 5-10x more than you need to. If you are using it, cache hit rate directly reduces cost per outcome.

How to measure: Most API providers log cache hit rate. Aggregate by agent and track weekly.

Target: 40%+ cache hit rate is achievable for agents that interact with the same knowledge base repeatedly (customer support, claims, loan origination). If you're at <20%, you might not have enough contextual similarity across calls.

8. Retry Rate

What percent of API calls fail and require retry?

Definition: (Failed API calls that were retried) / (total API calls) × 100

Why it matters: Every retry is an extra API call, which doubles or triples the cost of that work item. Retries often signal a problem: unstable API, network issues, or a prompt that's too brittle. Tracking retry rate forces you to optimize.

How to measure: Log failures and retries. Categorize by failure type: timeout, rate limit, model error, validation failure.

Target: <5% retry rate is the goal. If you're consistently at 10-15%, investigate the root cause. High retry rates are a sign of infrastructure problems or poorly-designed error handling.

9. Latency to Outcome (P95)

How long does it take the agent to deliver an outcome, 95th percentile?

Definition: Time from work item reception to outcome delivery, 95th percentile

Why it matters: Latency affects user experience and operational SLAs. If your agent resolves a ticket in 30 seconds but takes 5 minutes for complex queries, users will escalate to humans. Tracking P95 (not average) shows you the tail behavior.

How to measure: Log timestamps for every work item from receipt to outcome. Calculate P95 weekly.

Target: For customer support, aim for <120 seconds P95 for routine resolutions. For claims or loans, <5 minutes might be acceptable depending on your SLA. If you're consistently hitting your SLA, latency can usually be deprioritized relative to accuracy.

10. Cost Variance vs. Forecast

How much did actual cost per outcome deviate from forecast?

Definition: (Actual cost per outcome - Forecasted cost per outcome) / Forecasted cost per outcome × 100

Why it matters: This is your reality-check metric. If you forecasted $8 per ticket and you're hitting $12, something changed. Maybe volume was lower (fixed costs spread thin). Maybe escalation was higher. Maybe API costs spiked. Whatever the cause, tracking variance forces you to investigate and adjust.

How to measure: Compare actual cost per outcome monthly against your original forecast. Break down variance by source: volume, escalation rate, API cost, human review rate.

Target: Variance <10% is good. Variance >20% means your forecast needs rework or your agent needs optimization.

11. Customer Satisfaction Delta (with AI vs. without)

Does customer satisfaction change when they interact with the AI agent vs. a human?

Definition: (CSAT score for AI-resolved tickets) - (CSAT score for human-resolved tickets)

Why it matters: An agent that reduces cost but hurts satisfaction is a net loss. Conversely, an agent that maintains or improves satisfaction while reducing cost is a clear win. Tracking CSAT delta ensures you're not sacrificing quality for cost.

How to measure: Survey customers after resolution. Track CSAT separately for AI and human resolutions.

Target: AI CSAT should be within 5 points of human CSAT. If AI is >10 points lower, customers notice the difference and you risk brand damage. If AI is equal or higher, you can market that as a win.

12. Headcount Equivalent Saved

How many full-time employees' work did the agent replace?

Definition: (Cost per outcome saved) / (annual cost per FTE) = FTE equivalent

Why it matters: This is the language executives understand. "Our AI agent saved 2.3 FTEs" is more visceral than "reduced cost by $118k." It also helps with retention (you can redeploy human staff rather than cutting headcount).

How to measure: Calculate: (total annual savings from agent) / (annual fully-loaded cost per FTE). For customer support at $65k fully-loaded per CSR, a $150k annual savings = 2.3 FTE equivalents.

Target: Most successful agents save 1-3 FTEs per deployment. If you save less, the ROI is weak. If you save more, you might be over-automating and creating quality issues.

How to Track These 12 KPIs

The best teams implement a single dashboard that shows all 12 metrics updated weekly. The dashboard should:

  1. Show current values and trends (4-week moving average).
  2. Highlight when any metric drifts >10% from forecast or target.
  3. Break down metrics by agent (if you have multiple agents).
  4. Show month-over-month and year-over-year comparisons.

Most finance and operations teams can build this dashboard in a data warehouse (BigQuery, Snowflake, Databricks) or a BI tool (Tableau, Looker, Mode Analytics) in 2-3 days, pulling from logs of work-item completion, cost attribution, and human review time.

Runrate's dashboard pre-builds this for you at the work-item level, so cost per outcome and all 12 KPIs update automatically as your agents run. But if you're building your own, these are the 12 metrics to prioritize.

The key is: measure all 12 together, not in isolation. Cost per outcome is the primary, but the other 11 are diagnostics that help you understand why it's moving. When cost per outcome drifts, you look at escalation rate, human review time, and cache hit rate to understand the cause.

Track these religiously, and you'll be in the 5% of organizations that extract consistent value from AI agents.

Calculate your AI ROI.

See what your agents actually cost — and what they're returning.

Open the Calculator

Was this article helpful?