AI Vendor Evaluation Framework: Cost, Control, and Outcomes

7 min read · Updated 2026-05-02

Runrate Framework

The AI Cost Iceberg

Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).

Read the full framework →

Evaluating AI vendors without a framework is like buying a house without an inspection: you might get lucky, or you might buy a money pit. Use this framework to score vendors objectively on five dimensions—cost transparency, attribution capability, lock-in risk, performance and reliability, and compliance—then rank them and negotiate.

The five evaluation dimensions

1. Cost Transparency (Weight: 25%)

Cost transparency is the most critical dimension because it predicts your ability to measure ROI and control spending. A vendor who hides cost is hiding their incentive to cut corners.

Scoring rubric:

  • Score 5 (Best in class): Vendor publishes cost-per-outcome benchmarks for your use case. Cost is exposed work-item-by-work-item in API responses. Vendor allows third-party cost audits.
  • Score 4 (Strong): Vendor publishes per-token and per-request pricing with a cost estimate for your use case. Most hidden costs (retries, tool calls, vector storage) are included in the estimate.
  • Score 3 (Acceptable): Vendor publishes per-token and per-request pricing. Some hidden costs are acknowledged. Cost can be extracted from usage reports.
  • Score 2 (Weak): Vendor publishes only headline pricing without breakdown. Hidden costs are not discussed. Cost attribution is limited.
  • Score 1 (Unacceptable): Vendor refuses to quote cost or gives only vague benchmarks like "cost depends on your use case." No transparency.

Questions to ask:

  • What's your cost per resolved work item for a similar customer at our volume?
  • What is the 95th percentile token usage per resolved item?
  • What's included in your per-request cost, and what's charged separately?

What this measures:

A vendor's willingness to expose cost is a proxy for their confidence in their economics. Klarna can say "$0.19 per resolved ticket" because they've tuned their agent to be cheap. A vendor that says "cost depends on your use case" either doesn't know their cost or doesn't want you to know it.

2. Attribution Capability (Weight: 25%)

Attribution is your ability to tie cost to outcome. Without attribution, you can't measure ROI, and you can't optimize. A vendor that doesn't expose work-item-level cost attribution is selling you a black box.

Scoring rubric:

  • Score 5 (Best in class): Cost data is tagged by work-item ID in real time. API response includes component-level cost breakdown (tokens, retries, tool calls, vector retrieval). Batch reporting by work-item class is available.
  • Score 4 (Strong): Cost data is tagged by work-item class. Daily batch reporting shows cost and volume by category. Work-item-level data is available on request.
  • Score 3 (Acceptable): Cost data is available in a dashboard. Monthly cost and usage reports are available. Work-item-level attribution is partial.
  • Score 2 (Weak): Cost data is aggregated only. You can see total cost per month, but not which work items drove the cost.
  • Score 1 (Unacceptable): No cost data is exposed; only the vendor can report on cost.

Questions to ask:

  • Can you expose cost per work item in your API responses?
  • What attribution metadata is included in each response?
  • Can we export cost logs and join them with our outcome data?

What this measures:

Most vendors have observability dashboards but no work-item-level cost hook. A vendor with first-class attribution is rare and worth paying for. This is what separates a "trust me on the benchmarks" vendor from a "here's your actual cost" vendor.

3. Lock-In Risk (Weight: 20%)

Lock-in risk is your switching cost if you want to change vendors in year 2. There are five lock-in vectors: model lock-in, data lock-in, prompt lock-in, API lock-in, and contract lock-in. A vendor that mitigates all five is rare; aim for at least three.

Scoring rubric:

  • Score 5 (Best in class): Vendor-agnostic API. Model-agnostic (Claude ↔ GPT swap without rearchitecture). Data export in standard format (JSON, Parquet). <2-week switching cost.
  • Score 4 (Strong): Model-specific but allows data and prompt export. Vendor switching cost <4 weeks.
  • Score 3 (Acceptable): Vendor-specific API; data export available; vendor switching cost 4–8 weeks.
  • Score 2 (Weak): Significant vendor lock-in; data export is lossy or charged separately; switching cost >8 weeks.
  • Score 1 (Unacceptable): Complete lock-in; no data export; impossible or prohibitively expensive to switch.

Questions to ask:

  • If you deprecate a model I depend on, what's my upgrade path?
  • Can we export our conversation history and fine-tuned models?
  • What's your model versioning policy?

What this measures:

Lock-in gets worse over time. Year 1 you're locked in by inertia and switching cost. Year 2 you're locked in by dependence on the vendor's latest model. Year 3 you're locked in by integration depth. A vendor who mitigates lock-in is giving you a genuine choice to stay, not a hostage situation.

4. Performance and Reliability (Weight: 15%)

SLAs matter if you're replacing human labor. If your agent is 99.5% available but your support team expects 99.99%, you have a problem.

Scoring rubric:

  • Score 5 (Best in class): Published SLA 99.9%+ uptime. p99 latency <500ms. Automatic rollback on model degradation. Multi-region redundancy.
  • Score 4 (Strong): Published SLA 99.5%+ uptime. p99 latency <1s. Incident response SLA <1 hour.
  • Score 3 (Acceptable): SLA available on request. 99%+ uptime. p99 latency <2s.
  • Score 2 (Weak): No published SLA. Uptime and latency unclear. Incident response slow.
  • Score 1 (Unacceptable): Frequent outages or unacceptable latency for your use case.

Questions to ask:

  • What's your published uptime SLA?
  • What's your p99 latency?
  • What happens if you exceed the SLA—what's the customer credit?

What this measures:

Reliability compounds. One outage that takes down your agent for 4 hours on a Friday night costs you customer escalations, manual work, and reputation damage. A vendor's SLA is a proxy for how seriously they take operations.

5. Compliance and Data Handling (Weight: 15%)

Compliance matters if you're in healthcare, financial services, or heavily regulated verticals. Even if you're not regulated, data handling is a proxy for vendor maturity.

Scoring rubric:

  • Score 5 (Best in class): SOC2 Type II, HIPAA, GDPR, FedRAMP compliant. Data residency options (US, EU). Annual audits.
  • Score 4 (Strong): SOC2 Type II, GDPR compliant. US and EU data residency.
  • Score 3 (Acceptable): SOC2 Type II. GDPR compliance. US residency only.
  • Score 2 (Weak): SOC2 Type I or limited compliance. Data handling unclear.
  • Score 1 (Unacceptable): No compliance certifications. No transparency on data practices.

Questions to ask:

  • What compliance certifications do you hold?
  • Where are your data centers?
  • Can you provide a Data Processing Agreement (DPA)?

What this measures:

Compliance certifications are expensive to earn and maintain. A vendor with real SOC2 Type II certification has been through a third-party audit and passed. A vendor that claims compliance but can't produce a report is probably lying.

The scoring spreadsheet

Create a spreadsheet with five columns: dimension, weight, vendor A score, vendor B score, vendor C score. For each dimension, score each vendor 1–5, then multiply by the weight and sum.

| Dimension | Weight | Vendor A | Vendor B | Vendor C | | --- | --- | --- | --- | --- | | Cost Transparency | 25% | 4 | 5 | 2 | | Attribution | 25% | 3 | 5 | 2 | | Lock-In Risk | 20% | 2 | 4 | 3 | | Performance | 15% | 4 | 5 | 4 | | Compliance | 15% | 3 | 4 | 4 | | Weighted Score | 100% | 3.35 | 4.65 | 2.95 |

In this example, Vendor B is the clear winner (4.65), with Vendor A in second place (3.35), and Vendor C well behind (2.95). The spreadsheet is objective and forces you to justify every dimension in writing.

Using the framework in practice

First, fill out the framework during RFP scoring. This forces you to read RFP responses carefully and compare objectively.

Second, use the framework to identify vendor strengths and weaknesses before the demo. For example, if Vendor A scored 2 on lock-in, design your demo around understanding their lock-in vectors and negotiating mitigation.

Third, use the framework to anchor contract negotiations. If Vendor B scored high on cost transparency but low on lock-in, negotiate lock-in protections (data export rights, model versioning guarantees) as part of the contract.

Finally, use the framework as your post-deployment scorecard. Six months after deployment, re-score the vendor on how well they've delivered on each dimension. If a vendor promised high cost transparency but your actual cost data is aggregated and delayed, that's a breach of the evaluation criteria and justifies renegotiation or exit.

For the full buyer's journey, see "How to Buy AI: The Executive's Vendor Evaluation Guide." For specific questions to ask during demos, see "What Questions to Ask in an AI Vendor Demo."

Want to see this in your stack?

Book a 30-minute walkthrough with a Runrate founder.

Get a Demo

Was this article helpful?