How to Evaluate AI Vendors

Runrate Framework

The AI Cost Iceberg

Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).

Runrate Framework

5-Stage AI Cost Maturity Curve

From Invisible → Tracked → Allocated → Optimized → Governed — where does your org sit?

Most enterprises buy AI the way they bought cloud in 2008: run a three-month POC with two vendors, accept their cherry-picked benchmarks at face value, then lock into a three-year contract with no baseline for cost-per-outcome and no plan to measure whether the agent actually delivers. According to MIT's GenAI Divide study, 95% of AI pilots fail to deliver material P&L impact—not because the technology doesn't work, but because buyers never defined what "work" costs in the first place. The solution is a structured vendor evaluation framework that forces clarity on cost transparency, attribution capability, integration debt, and measurement rigor before you sign.

What's broken about how most enterprises buy AI today

The typical AI vendor purchase follows a familiar rhythm. Your team runs a POC with Vendor A and Vendor B. Both promise the world. Vendor A's benchmark shows they can resolve support tickets at 45 cents per ticket; Vendor B claims 52 cents. You pick Vendor A, scale the deployment, then discover six months later that your actual cost-per-ticket is $1.80. The discrepancy isn't dishonesty—it's invisibility. Vendor A's benchmark assumes perfect conditions: no retries, no failed API calls to your CRM, no human-in-the-loop review, no prompt caching, no observability overhead. Your production environment includes all of it.

The AI Cost Iceberg is the frame here. What vendors show you—their headline token cost and API pricing—is about 10% of your true cost. The other 90% is infrastructure: inference retries when the first call fails, gateway overhead, vector database storage for context retrieval, observability and monitoring, human review loops, and the cost of tools the agent calls to third-party systems (Stripe, Twilio, Salesforce). Most CFOs budget for the visible API bill and get surprised by the hidden cost structure.

A second failure mode is the missing attribution layer. You deploy an AI agent to your customer service team. Six months in, you want to answer: "What's the ROI on this?" Your observability tells you call volume and average resolution time. It doesn't tell you which calls were AI-assisted versus human, which calls required escalation, or whether the AI agent's speed gains came at the cost of resolution quality or customer satisfaction. You have no cost-per-outcome baseline to compare against. Intercom's Fin reports $0.99 per resolution, Klarna reports $0.19 per resolved ticket, Sierra reports around $1.50—but you have no way to know if you're in that ballpark or if your implementation is a 3x outlier.

A third failure mode is vendor cherry-picking. If you're evaluating three vendors and one vendor's benchmark is more aggressive than the others, it's often because they're defining "resolved" or "successful outcome" differently. Vendor A counts a ticket as resolved when the agent sends a response. Vendor B counts it as resolved only when the customer confirms satisfaction. Vendor C counts it as resolved only if no escalation occurs within 48 hours. Without a strict, shared definition of what outcome you're measuring, you're comparing apples to pineapples.

Finally, most enterprises skip the integration-debt conversation. An AI agent doesn't live in isolation. It needs to read from your CRM, write to your ticketing system, call your payment processor, and log outcomes to your data warehouse. Every integration is a vector for latency, failure, and cost drift. Vendors who don't surface integration cost as a line item are hiding it.

The buyer's journey: From needs assessment to measurement

A structured buying process has seven stages. Each stage has a specific question to answer before moving to the next.

Stage 1: Needs assessment—Define the work item and the baseline. You can't measure ROI without a baseline. Before you talk to any vendor, know your current state: How many customer service tickets do you handle per month? What's the current cost per ticket (including labor)? What percentage are routine versus complex? How many are escalated? This is your North Star. Every vendor's benchmark should be compared against your baseline, not accepted on its own terms.

The work item matters more than the function. "AI for customer service" is too broad. "AI for password-reset and billing-inquiry tickets in English-speaking markets" is specific enough to benchmark. When you can break the function into atomic work items, you can attribute cost and outcome separately. Klarna doesn't say "AI customer service costs $0.19 per ticket"—they say Klarna's AI team resolved password resets and account access tickets at that rate, for English-speaking customers, with a 72-hour SLA. That specificity is what makes the number credible.

Stage 2: Vendor longlist—Look for cost transparency first. Create a list of 5–8 vendors in your category. The first screen is not capability—it's willingness to be transparent about cost. Call each vendor's sales team and ask: "What's your cost per resolved work item, and what assumptions does that include?" A vendor that gives you a number is worth talking to. A vendor that says "cost depends on your use case" without any framework is hiding something.

Also ask: "Can you expose cost attribution in your API responses?" Cost attribution is the ability to tag each API call with the work item it served (the support ticket ID, the claim ID, the loan application ID) so you can later calculate true cost-per-outcome. Vendors with first-class attribution APIs are rare. Most vendors have observability dashboards but no hook into your analytics. That's a problem.

Stage 3: RFP—Ask 50 specific questions about cost, control, and outcomes. Don't accept a vendor's template RFP. Write your own. The questions should force precision on five topics: cost modeling, attribution capability, integration and lock-in, performance and SLA, and compliance and data. See the companion article "The AI RFP Template: 50 Questions Every Buyer Should Ask" for the full list, but in brief:

Cost questions should ask: What's the 95th percentile token usage per resolved work item? How does pricing scale above tier 1? Is there a true-up clause if you exceed volume? What's the commitment term, and what are the cancellation terms?

Attribution questions should ask: What metadata is exposed in your API responses? Can you tag results by work-item ID? Do you offer batch cost reporting by work-item class?

Lock-in questions should ask: How is your model versioning defined? If you deprecate Claude 3.5 Sonnet for a newer model, do I have to re-evaluate my agent? What data can I export, and in what format?

Integration questions should ask: How many API calls to my CRM/database/ticketing system are allowed per resolved work item? Are those calls included in your token count or charged separately?

Stage 4: Vendor demo—Watch for the attribution question moment. A vendor demo is typically a scripted 30-minute presentation. The useful part is the last 15 minutes, when you ask the hard questions. See "What Questions to Ask in an AI Vendor Demo" for the full list, but the key moment is this: Ask the vendor to show you their cost attribution interface. Pull up a specific resolved work item and ask them to show you the cost breakdown: tokens, API calls, retries, and tool calls. A vendor that can do this in real time has attribution built in. A vendor that says "I'll have to get back to you on that" doesn't.

Stage 5: POC—Set cost-per-outcome as a mandatory gate. Run a POC with two vendors maximum. The POC should be 8–12 weeks, not 3 months. Most of the learning happens in the first 4 weeks; the last 2 months are just confirmation bias.

Define a success metric upfront. "Successful" is not "the agent resolved some tickets." It's "the agent resolved tickets at $0.45 per ticket or less, with no escalation rate above 8%, and a first-contact resolution rate above 90%." These are all measurable. Before you deploy, agree with the vendor on the benchmark. After 6 weeks, measure. If the numbers align, move forward. If they don't, the vendor should explain why and offer to fix the agent or renegotiate the contract.

The hardest part of the POC is that 95% POC failure stat. Most POCs fail not because the technology doesn't work, but because success was never defined. A POC that doesn't include a pre-agreed cost-per-outcome target is not a POC—it's a trial.

Stage 6: Contract negotiation—Nail down cost models, true-up terms, and exit clauses. The contract should specify: (1) your baseline cost-per-outcome and the tolerance band (e.g., $0.45 ±$0.10); (2) the true-up and true-down clauses (if you exceed the agreed volume, you pay for overages; if you use less, you get a credit); (3) the MFN clause (most-favored-nations—if the vendor gives a lower price to a peer, you get it too); (4) what happens if the vendor deprecates a model or changes its pricing; (5) your data ownership and export rights; (6) your termination rights and notice periods; and (7) the vendor's audit rights and your audit rights on their invoice.

See "How to Negotiate AI Vendor Contracts in 2026" for the full playbook, but the biggest lever is cost transparency. If you commit to 12 months at a 40% discount in exchange for true cost visibility—meaning the vendor opens their cost ledger to your auditor—you've flipped the incentive. The vendor is no longer hiding cost; they're being transparent about it because you're paying for transparency.

Stage 7: Deployment and measurement—Set up the attribution layer from day one. The moment the agent goes live, you should have a cost-per-outcome dashboard running. This requires engineering work. You need to tag every API call from the agent with the work-item ID it served. You need to ingest the vendor's cost data and join it with your outcome data (resolution status, escalation flag, customer satisfaction score). You need to update the dashboard weekly. This is not a one-time project; it's operational infrastructure.

The AI Cost Iceberg and hidden vendor costs

Most vendor pricing shows you the API cost and the per-request fee. That's the tip of the iceberg. Your total cost includes:

Inference retries. When an API call fails (rate limit, timeout, invalid response), you retry. A vendor who doesn't explicitly bound retry cost is hiding budget. Ask: "What's your SLA on first-call success rate, and what cost do I bear for retries?"

Vector database storage and retrieval. If the agent needs context, you're storing embeddings in a vector database (Pinecone, Weaviate, Qdrant) and paying per query. This cost scales with context window length and retrieval volume. Ask: "Does your pricing include vector retrieval, or am I paying separately?"

Tool calls and third-party APIs. Every time the agent calls your CRM or your payment processor, you're paying CRM/processor pricing, not just vendor pricing. A smart vendor will help you estimate this. A vendor who ignores it is understating total cost.

Human-in-the-loop review. Some outcomes need human review before they're finalized. A claims agent that auto-approves claims under $10,000 but escalates above that is hiring humans to review the hard cases. That labor cost is part of the true cost-per-outcome, and it's often 3–5x higher than the agent cost itself.

Observability and monitoring. You need logging, tracing, and alerting on agent behavior. Some vendors include this; most vendors expect you to build it with Langfuse, LangSmith, or similar tools. Budget $5K–$20K per month for observability infrastructure if you're running at scale.

Prompt caching and context management. If the agent processes the same customer record or system prompt 100 times a day, caching can cut token cost in half. But caching infrastructure adds complexity and cold-start latency. Ask vendors: "Do you support prompt caching, and what's the latency trade-off?"

This is why the AI Cost Iceberg framework matters. A vendor's headline price of "50 cents per ticket" can easily become $1.80 once you include retries, escalation, human review, observability, and integration cost. The vendors who win on cost are not the ones with the lowest API rates; they're the ones who minimize the hidden costs through architectural rigor.

The vendor evaluation rubric: Cost, control, and outcomes

Use this rubric to score vendors objectively. Each dimension is scored 1–5 (1 = unacceptable, 5 = best in class). Multiply each dimension by its weight, sum, and compare.

Cost Transparency (Weight: 25%)

5 = Vendor publishes cost-per-outcome benchmarks, exposes cost metadata in API, allows third-party cost audits
4 = Vendor publishes cost per request and per token, includes most hidden cost vectors in estimate
3 = Vendor publishes per-request pricing and API cost, some hidden costs acknowledged
2 = Vendor publishes only token or request pricing, ignores integration and observability cost
1 = Vendor refuses to quote cost or gives only vague benchmarks

Attribution Capability (Weight: 25%)

5 = Cost data tagged by work-item ID in real time; API exposes cost breakdown by token, retry, tool call
4 = Cost data tagged by work-item class (category of ticket); batch reporting available
3 = Cost data available in dashboard; can be exported monthly; work-item tagging partial
2 = Cost data available but aggregated; no work-item-level visibility
1 = No cost data exposed; only vendor can report on cost

Integration and Lock-in (Weight: 20%)

5 = Vendor-agnostic API, model-agnostic (can swap Claude for GPT), data export in standard format, <2-week switching cost
4 = Model-specific but allows data export; vendor switching cost <4 weeks
3 = Vendor-specific API; data export available; switching cost 4–8 weeks
2 = Significant vendor lock-in; data export limited or lossy; switching cost >8 weeks
1 = Complete vendor lock-in; no data export; effectively impossible to switch

Performance and SLA (Weight: 15%)

5 = Published SLA with 99.9% uptime, <500ms p99 latency, automatic rollback on model degradation
4 = Published SLA with 99.5%+ uptime, <1s p99 latency
3 = SLA available on request, 99%+ uptime, <2s latency
2 = No published SLA; uptime and latency benchmarks unclear
1 = Frequent outages or unacceptable latency; no SLA

Compliance and Data (Weight: 15%)

5 = SOC2 Type II, FedRAMP, HIPAA, GDPR compliant, data residency options
4 = SOC2 Type II, GDPR compliant, data residency in US and EU
3 = SOC2 Type II, general GDPR compliance, US residency
2 = SOC2 Type I or limited compliance
1 = No compliance certifications; data handling unclear

The POC evaluation checklist

When you run a POC, use this checklist to measure whether the pilot is worth scaling. See "AI POC Evaluation: How to Know If a Pilot Is Worth Scaling" for details, but in brief:

[ ] Cost-per-outcome baseline defined and agreed before pilot starts
[ ] Actual cost-per-outcome within 15% of baseline by week 6
[ ] Escalation rate and quality metrics defined and tracked weekly
[ ] Attribution data exposed in the vendor's API from day one
[ ] Integration debt identified and remediated (no major latency surprises in production)
[ ] Change management plan in place (team trained, workflows updated, incentives aligned)
[ ] Scaling triggers defined (what volume triggers a refresh of the contract, what cost-per-outcome decline justifies expansion)
[ ] Vendor willing to discuss cost true-down if actual volume is lower than forecast

If five or more of these boxes are unchecked at week 6, stop the POC and renegotiate. A POC that doesn't have attribution data or cost baselines is not a POC—it's a confidence game.

Contract terms that protect you: MFN, true-down, and audit clauses

Three contract clauses matter more than price:

True-down clauses. Most contracts have true-up clauses: if you use more volume than expected, you pay overages. Insist on true-down clauses: if you use less volume than expected, you get a credit or refund. True-down forces the vendor to right-size their forecast and gives you flexibility to scale back if the agent underperforms.

MFN clauses. A most-favored-nations clause says: if the vendor gives a lower price to a peer company, you get that price. MFN prevents the vendor from giving your competitor a 30% discount while you pay full freight.

Audit rights. You should have the right to audit the vendor's invoice. Ask for log-level access to confirm: Did the agent really make 50,000 API calls, or was it 40,000? Did the token count match your observability data? A vendor that refuses audit rights is hiding something. If you commit to an NDA that covers the vendor's cost structure, most vendors will allow annual audits.

The post-deployment question every buyer forgets to ask

After six months of deployment, most teams have moved on. The agent is working, the budget is approved, the project is considered "done." This is when you should ask the question that separates cost-transparent vendors from cost-hiding vendors:

Can you prove that this deployment is cheaper than the alternative?

The alternative might be hiring a full-time person ($80K salary + 40% benefits = $112K per year = $9,333 per month). The alternative might be a previous vendor. The alternative might be status quo—you just handle fewer tickets and leave money on the table. A vendor who can point to your cost data and say "At your current volume of 5,000 tickets per month at $0.55 per ticket, you're spending $2,750 monthly, versus $9,333 for a full-time CSR, plus this agent is available 24/7" is a vendor you can trust. A vendor who says "The benchmarks show we're cost-competitive" is hiding cost.

When you're ready to evaluate vendors with rigor, start with the needs assessment. Know your baseline. Write your own RFP. Run a POC with a pre-agreed cost target. And make sure your contract has true-down, MFN, and audit clauses. The goal is not to find the vendor with the lowest API rate—it's to find the vendor with the most cost transparency and the clearest path to proving ROI to your board.

Want to see this in your stack?

Book a 30-minute walkthrough with a Runrate founder.

Get a Demo

How to Buy AI: The Executive's Vendor Evaluation Guide

What's broken about how most enterprises buy AI today

The buyer's journey: From needs assessment to measurement

The AI Cost Iceberg and hidden vendor costs

The vendor evaluation rubric: Cost, control, and outcomes

The POC evaluation checklist

Contract terms that protect you: MFN, true-down, and audit clauses

The post-deployment question every buyer forgets to ask

Want to see this in your stack?

Articles in AI Buying & Vendor Management

The AI Procurement Checklist

The AI RFP Template: 50 Questions Every Buyer Should Ask

AI Vendor Evaluation Framework: Cost, Control, and Outcomes

Build vs Buy AI: The Executive's Decision Framework

How to Negotiate AI Vendor Contracts in 2026

AI POC Evaluation: How to Know If a Pilot Is Worth Scaling

What Questions to Ask in an AI Vendor Demo

AI Vendor Lock-In: How to Avoid It Without Sacrificing Performance

How to Audit an AI Vendor Invoice