Token Cost Calculator (Multi-Model)

Model	Eff. in / out per 1M	Per query	Monthly	Cache savings
Gemini 2.5 Pro Long-context, strong cost/perf	$0.91 / $5.00	$0.0048	$241	−$34
Claude Sonnet 4 Default for production	$2.19 / $15.00	$0.013	$669	−$81
GPT-4o General-purpose	$3.65 / $15.00	$0.016	$815	−$135

The Token Cost Calculator is the simplest tool with the highest ROI in the Runrate suite: input your average input/output token counts and monthly query volume, and see the per-call cost, monthly spend, and annual cost across every major LLM at current pricing. Use it to compare models head-to-head, evaluate model migrations, justify procurement decisions, and benchmark your AI cost against industry norms — before signing a vendor contract or scaling an agent deployment.

Why per-token pricing is the wrong unit for budgeting (but the right unit for comparison)

A common mistake in 2024–2025 AI procurement was using per-token pricing as a proxy for total cost of ownership. It isn't. Per-token cost is roughly 5–15% of the loaded cost of running an AI agent in production — the rest is integration tax, vector database, observability, retry overhead, and human review. The AI Cost Iceberg framework explains why the visible per-token bill is just the tip.

That said, per-token cost is the right unit for one specific job: comparing models head-to-head with everything else held constant. If you're deciding between Claude Sonnet and GPT-4o for the same workflow, with the same prompts, the same retrieval architecture, the same observability stack, and the same review rate, the only thing that changes is the per-token cost. Get that comparison right and you save tens of thousands of dollars a year. Get it wrong and you're locked into a model whose pricing curve doesn't match your scaling pattern.

This calculator does the comparison. The deeper question of total loaded cost belongs in the AI Agent Cost Calculator (the sister tool); this one is laser-focused on the model layer.

Current pricing (May 2026)

Pricing changes quarterly. The table below reflects current public list pricing as of May 2026; always verify with each vendor's pricing page before making procurement decisions.

| Model | Provider | Input ($/1M tokens) | Output ($/1M tokens) | Context window | Best for | |-------|----------|--------------------:|---------------------:|----------------|----------| | Claude Opus 4 | Anthropic | $15.00 | $75.00 | 200K | Highest-accuracy reasoning, complex agentic workflows | | Claude Sonnet 4 | Anthropic | $3.00 | $15.00 | 200K | Default for most production agents | | Claude Haiku 4 | Anthropic | $0.80 | $4.00 | 200K | High-volume, lower-complexity tasks | | GPT-5 | OpenAI | $15.00 | $60.00 | 256K | Frontier reasoning, latest flagship | | GPT-4o | OpenAI | $5.00 | $15.00 | 128K | General-purpose; strong multimodal | | GPT-4o mini | OpenAI | $0.15 | $0.60 | 128K | Routing, classification, simple tasks | | Gemini 2.5 Pro | Google | $1.25 | $5.00 | 1M | Long-context, reasoning, competitive pricing | | Gemini 2.5 Flash | Google | $0.075 | $0.30 | 1M | Ultra-cheap; high-volume, simple tasks | | Mistral Large 2 | Mistral | $2.00 | $6.00 | 128K | European data residency, balanced cost | | Mistral Small 3 | Mistral | $0.20 | $0.60 | 32K | Lightweight, cheap, EU-hosted | | LLaMA 3.3 70B (self-hosted) | Meta | n/a | n/a | 128K | Self-hosting at scale; infrastructure cost only | | LLaMA 3.3 8B (self-hosted) | Meta | n/a | n/a | 128K | Edge / low-latency self-hosting | | DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 128K | Aggressive pricing; reasoning-strong | | Cohere Command R+ | Cohere | $2.50 | $10.00 | 128K | Enterprise RAG focus |

Pricing notes:

Volume discounts. Enterprise contracts typically negotiate 10–30% off list at $200K+ annual commit. Anthropic and OpenAI both offer reserved capacity tiers. Use list pricing for back-of-envelope; replace with your contract pricing for actual budgeting.
Cached input pricing. Anthropic and OpenAI both offer cached-input pricing (60–90% discount on input tokens that hit cache). For RAG workloads with stable system prompts, effective input cost can be 30–60% of list. The calculator below treats this as a separate scenario.
Batch processing. Both OpenAI and Anthropic offer batch API discounts (~50% off) for non-real-time jobs. Useful for offline workloads (overnight document processing, large-scale evaluations).
Output tokens are 3–5x more expensive than input tokens at every major vendor. This is the single most important fact in token cost optimization — the cheapest path to cost reduction is almost always reducing output token count, not input.
Self-hosted models (LLaMA, Mistral Small) have $0 marginal token cost but real infrastructure cost. The breakeven analysis is in a section below.

How the calculator works

Inputs:

Model selection (multi-select for side-by-side comparison): pick 2–6 models to compare
Average input tokens per query: e.g., 2,500
Average output tokens per query: e.g., 600
Total queries per month: e.g., 50,000
Cached input ratio (optional, 0–100%): e.g., 40% (RAG with stable system prompt)
Volume discount (optional, 0–30%): e.g., 15% (your negotiated contract rate)

Calculation:

effective_input_price = input_price × (1 − volume_discount) × (1 − cached_ratio × 0.9)
effective_output_price = output_price × (1 − volume_discount)

cost_per_query =
  (input_tokens × effective_input_price ÷ 1,000,000) +
  (output_tokens × effective_output_price ÷ 1,000,000)

monthly_cost = cost_per_query × queries_per_month
annual_cost = monthly_cost × 12

Outputs:

Cost per query, by model
Monthly cost, by model
Annual cost, by model
Cost delta between cheapest and most expensive (% and absolute)
Cost-per-1K-queries (for normalization across volumes)
Sensitivity to ±20% in input tokens and ±20% in output tokens

Worked example 1: Customer service agent at mid-market scale

Inputs:

Input tokens per ticket: 2,000 (customer message + system prompt + retrieved context)
Output tokens per ticket: 600 (response + reasoning trace)
Queries per month: 50,000 (mid-sized contact center)
Cached input ratio: 30% (system prompt and FAQ context cached)
Volume discount: 0% (list pricing baseline)

Cost per query (effective input price applied):

| Model | Input cost/query | Output cost/query | Total/query | Notes | |-------|-----------------|-------------------|------------:|-------| | Claude Sonnet 4 | $0.0050 | $0.0090 | $0.0140 | Default recommendation | | Claude Haiku 4 | $0.0013 | $0.0024 | $0.0037 | If accuracy holds, biggest win | | GPT-4o | $0.0083 | $0.0090 | $0.0173 | Marginally pricier than Sonnet | | GPT-4o mini | $0.0003 | $0.0004 | $0.0007 | Watch for accuracy collapse | | Gemini 2.5 Pro | $0.0021 | $0.0030 | $0.0051 | Strong cost/perf option | | Gemini 2.5 Flash | $0.0001 | $0.0002 | $0.0003 | Ultra-cheap; routing only | | DeepSeek V3 | $0.0005 | $0.0007 | $0.0011 | Aggressive pricing; eval carefully |

Monthly cost (50,000 tickets):

| Model | Monthly | Annual | |-------|--------:|-------:| | Claude Sonnet 4 | $700 | $8,400 | | Claude Haiku 4 | $185 | $2,220 | | GPT-4o | $865 | $10,380 | | GPT-4o mini | $35 | $420 | | Gemini 2.5 Pro | $255 | $3,060 | | Gemini 2.5 Flash | $15 | $180 | | DeepSeek V3 | $55 | $660 |

Read the spread. From Gemini Flash ($180/year) to GPT-4o ($10,380/year), the variation is 58x at identical token volumes. But raw cost isn't the decision — accuracy on your specific tickets is. GPT-4o mini and Gemini Flash are calibrated for routing and classification; they will not match Claude Sonnet's accuracy on full ticket resolution. The honest decision framework: if Sonnet hits 92% first-contact resolution on your tickets and Flash hits 76%, the 16-point accuracy gap is worth far more than the $8,000/year savings. Run a real eval before committing to the cheapest model.

Worked example 2: Complex claims adjudication

Inputs:

Input tokens per claim: 8,000 (longer documents, policy excerpts, prior history)
Output tokens per claim: 1,500 (detailed adjudication reasoning + audit trail)
Queries per month: 8,000 claims
Cached input ratio: 50% (policy docs and procedure manuals stable)
Volume discount: 10% (enterprise contract)

Cost per query (effective input price applied):

| Model | Input cost/query | Output cost/query | Total/query | |-------|-----------------|-------------------|------------:| | Claude Opus 4 | $0.0729 | $0.1013 | $0.1742 | | Claude Sonnet 4 | $0.0146 | $0.0203 | $0.0349 | | GPT-5 | $0.0729 | $0.0810 | $0.1539 | | GPT-4o | $0.0243 | $0.0203 | $0.0446 | | Gemini 2.5 Pro | $0.0061 | $0.0068 | $0.0129 |

Monthly cost (8,000 claims):

| Model | Monthly | Annual | |-------|--------:|-------:| | Claude Opus 4 | $1,394 | $16,728 | | Claude Sonnet 4 | $279 | $3,348 | | GPT-5 | $1,231 | $14,772 | | GPT-4o | $357 | $4,284 | | Gemini 2.5 Pro | $103 | $1,236 |

Read the spread. The decision in claims adjudication is rarely about saving the $13K/year between Opus and Sonnet; it's about whether the marginal accuracy of the frontier model justifies the cost. If Opus correctly adjudicates 96% of claims and Sonnet correctly adjudicates 93%, that 3-point gap on 8,000 monthly claims is 240 wrong adjudications/month. At $300 average remediation cost per wrong denial (regulatory exposure, member complaint, manual reversal), that's $72,000/month — vastly more than the $1,115/month delta in API cost. The frontier model wins on these economics.

The opposite case: if Sonnet hits 94.5% and Opus hits 95% (and the 0.5-point lift produces $40K/month in remediation savings), the Opus premium isn't justified. The eval data, not the price sheet, makes the decision.

Worked example 3: AP invoice automation

Inputs:

Input tokens per invoice: 1,500 (invoice content + GL chart + vendor history)
Output tokens per invoice: 200 (coding decision + confidence + audit note)
Queries per month: 200,000 (large enterprise AP volume)
Cached input ratio: 60% (GL chart and vendor master highly cacheable)
Volume discount: 20%

Cost per query:

| Model | Total/query | Monthly | Annual | |-------|------------:|--------:|-------:| | Claude Sonnet 4 | $0.0034 | $683 | $8,196 | | Claude Haiku 4 | $0.0009 | $182 | $2,184 | | GPT-4o mini | $0.0001 | $25 | $300 | | Gemini 2.5 Flash | $0.00006 | $12 | $144 |

Read the spread. AP invoice coding is exactly the kind of high-volume, low-complexity workload where the smaller models earn their keep. If Haiku hits 96% accuracy (and humans review the 4% it flags as low confidence), the $6,012/year savings vs Sonnet is real money — and there's no reason to pay for Sonnet's reasoning on what is essentially a classification problem. Run the eval, confirm Haiku holds up, and pocket the savings.

Model migration ROI: when to switch

A common scenario: you deployed an agent on an older or pricier model and a newer/cheaper model has shipped. Use the calculator to size the migration ROI.

Example: GPT-4 Turbo → GPT-4o migration

Old model (GPT-4 Turbo at deprecated pricing): input $10/1M, output $30/1M New model (GPT-4o): input $5/1M, output $15/1M

For an agent doing 100,000 queries/month at 2,000 input + 600 output tokens:

| Metric | GPT-4 Turbo | GPT-4o | Delta | |--------|------------:|-------:|------:| | Cost/query | $0.0380 | $0.0190 | -50% | | Monthly cost | $3,800 | $1,900 | -$1,900 | | Annual cost | $45,600 | $22,800 | -$22,800 |

Migration cost (testing, eval, integration): typically $15K–$40K depending on complexity. Payback is 2–8 months. For most teams running on GPT-4 Turbo, GPT-4o was a no-brainer migration in 2024.

Example: Claude Sonnet 3.5 → Claude Sonnet 4

If pricing held (it did at $3/$15) but accuracy improved 2 points and latency dropped 30%, the migration is essentially free upside. Migration cost is small (same API, same model family); benefit is real (faster, more accurate, lower retry rate). These free-upgrade migrations should happen automatically; the calculator helps you confirm there's no pricing regression.

Example: Single-model → multi-model routing

A more sophisticated migration: replace a single Claude Sonnet deployment with a router that sends simple queries to Haiku and complex queries to Sonnet. If 60% of queries are simple:

| Setup | Cost/query (blended) | Monthly (50K queries) | |-------|---------------------:|----------------------:| | All Sonnet | $0.0140 | $700 | | Router: 60% Haiku, 40% Sonnet | $0.0078 | $390 |

Annual savings: $3,720. Router infrastructure cost: typically $3K–$8K to build, $200–$500/month to operate. Payback under 12 months even in the cheapest case. The routing pattern is increasingly the default for production AI agents in 2026.

Self-hosted vs API: when infrastructure cost beats per-token cost

The calculator includes a self-hosted scenario for LLaMA 3.3 70B and Mistral Small 3. Self-hosting has $0 marginal per-token cost, but real GPU infrastructure cost.

Example: LLaMA 3.3 70B on a 4×H100 cluster

Infrastructure: 4×H100 GPUs (cloud-rented at ~$3.50/GPU-hour) = $14/hour = ~$120,000/year (24/7, before any reserved-capacity discount)
Throughput: ~80 queries per second sustained on a tuned setup
Annual capacity: 80 × 86,400 × 365 × 0.7 utilization = ~1.77B queries/year

Compared to Claude Sonnet 4 at $0.014/query:

Breakeven volume: $120,000 ÷ $0.014/query = ~8.6M queries/year (~715K queries/month).

Below 715K queries/month, Claude Sonnet API is cheaper than self-hosted LLaMA. Above 715K queries/month, self-hosted starts winning, with the gap widening as volume scales. At 5M queries/month, self-hosted is roughly $120K/year vs $840K/year for Claude — a $720K/year savings.

Caveats that the simple breakeven misses:

Engineering cost. Running a production LLaMA cluster requires 1–2 ML engineers ($300K–$500K/year loaded). For most companies, this dwarfs the API savings until volume is genuinely huge.
Reliability and SLA. API providers handle uptime, scaling, model updates. Self-hosted requires you to handle all of that. The hidden labor cost is significant.
Accuracy gap. LLaMA 3.3 70B is closer to Claude Sonnet/GPT-4o than ever, but still typically 2–5 points behind on agentic benchmarks. The accuracy gap may cost more than the infrastructure savings.
Quantization and optimization tradeoffs. Self-hosted often runs quantized weights (FP8, INT4) for throughput; that quantization costs accuracy. Production deployments need to measure this.

The honest decision framework:

<200K queries/month: API. Self-hosting doesn't make sense at this scale.
200K–700K queries/month: API, unless you have ML platform engineers already on staff.
700K–2M queries/month: Evaluate both seriously. Run a parallel deployment and measure cost, accuracy, and operational overhead.
>2M queries/month: Self-hosting becomes structurally attractive. Many high-volume B2C deployments live here.

Cached input pricing: the largest cost optimization most teams miss

Both Anthropic and OpenAI offer cached input pricing — when the same prompt prefix is sent in subsequent calls, the cached portion is billed at 10% of normal input price. For RAG workflows, system-prompt-heavy agents, and any workload with stable context, this changes the math materially.

Example: RAG agent with 5,000-token stable context

Input tokens per query: 5,000 stable + 500 dynamic = 5,500 total
Output tokens per query: 800
Volume: 100,000 queries/month
Model: Claude Sonnet 4

Without cache:

Input cost/query: 5,500 × $3/1M = $0.0165
Output cost/query: 800 × $15/1M = $0.0120
Total: $0.0285/query × 100,000 = $2,850/month

With cache (5,000 tokens hit cache, 500 fresh, 90% input discount on cached):

Cached input cost: 5,000 × $0.30/1M = $0.0015
Fresh input cost: 500 × $3/1M = $0.0015
Output cost: $0.0120
Total: $0.015/query × 100,000 = $1,500/month

Savings: $1,350/month, $16,200/year, 47% reduction.

For most production RAG workloads, cache hit rates of 40–80% are achievable with moderate engineering effort. The infrastructure cost to enable caching is essentially zero — it's a flag in the API call. Most teams that haven't enabled it are leaving 30–50% of their token budget on the table.

Procurement playbook: how to use this calculator

When evaluating AI vendors or deploying a new agent, run these five comparisons in the calculator:

1. Standardize on a base case. For a customer service agent: 2,000 input + 600 output tokens, 50,000 queries/month. For a claims agent: 8,000 input + 1,500 output, 8,000 queries/month. For an AP agent: 1,500 input + 200 output, 200,000 queries/month. This forces apples-to-apples comparison across vendors who might quote prices differently.

2. Calculate the cost delta. If Model A is $0.014/query and Model B is $0.019/query, that's a 36% price premium. Is the accuracy gap worth 36%? Run the eval, get a number, decide.

3. Test sensitivity to your actual token count. Vendors quote based on assumed token counts. Your actual workload may use 30% more or 30% fewer tokens than the standard case. Recalculate with your real numbers — the rank order can change.

4. Model the cached-input scenario. If you're running RAG with stable context, enable cache pricing in the calculator. Compare both cache-on and cache-off — the cache-on numbers are usually what you'll actually pay if your team is competent.

5. Plan for model deprecation and migration. Every model has a successor on a 12–18-month cadence. When you sign a contract, calculate the migration ROI to the next model that's likely to ship. If the migration math is favorable, build the eval and migration capacity into your deployment plan from day one.

Common procurement traps the calculator surfaces

Three traps that the calculator catches before you sign:

Trap 1: Output-heavy workloads with reasoning models. Reasoning models (Claude Opus, GPT-5, DeepSeek R1) emit far more output tokens than non-reasoning models because the reasoning trace counts as output. A "simple" $0.05/query estimate can balloon to $0.30/query if you don't account for the reasoning-trace overhead. The calculator forces you to enter realistic output token counts; pad for reasoning if you're using a thinking model.

Trap 2: Long-context pricing surprises. Some models price differently above certain context thresholds (e.g., Gemini 1.5 Pro had different pricing above 128K input tokens). If your workload occasionally hits these thresholds, the average cost-per-query can be materially higher than the headline number.

Trap 3: Multi-step agentic workflows undercounted. A "single query" in agentic workflows can fan out to 5–15 underlying LLM calls (planner, retriever, reasoner, validator, output formatter). The headline per-query cost should be multiplied by the agentic call count to get realistic spend. The calculator has an optional "agentic multiplier" field for exactly this reason.

Limitations and caveats

The calculator assumes:

Pricing is current as of May 2026. Vendors change pricing quarterly. Always cross-check against the live pricing page for the model you're considering before signing a contract or building a long-term forecast.
List pricing applies. Enterprise customers negotiate 10–30% off list. If you have a contract, override the volume discount field.
No batch processing. The calculator assumes real-time, on-demand inference. Batch API discounts (~50% off at OpenAI and Anthropic) apply only to non-real-time workloads — useful for large overnight runs, not customer-facing agents.
No latency consideration. Cheaper models often have higher latency. If you're building a real-time customer experience, cost-per-query alone misses the question of whether the experience is acceptable.
No accuracy adjustment. The calculator treats per-query cost as the only variable. In production, accuracy differences across models translate into review-rate differences, retry-rate differences, and failure-cost differences that often dwarf the API cost gap.

The calculator is the right tool for one job: comparing the model layer of cost across vendors. For total loaded cost — including all five layers of the AI Cost Iceberg — use the AI Agent Cost Calculator. For ROI sizing against a human baseline, use the AI ROI Calculator. For maturity benchmarking, use the AI Cost Maturity Self-Assessment.

What to do next

Open the calculator. Input your actual workload (check your API logs for real token counts; don't guess). Compare 3–5 models head-to-head using your real volume. Identify the cheapest viable model and the most-accurate model; the right answer is usually somewhere between them, picked based on a real eval.

Then look at the gap between API cost and your actual AI spend. The calculator's number is a floor — your real spend, including infrastructure, integration, and review, is 5–10x higher. If you don't have visibility into the loaded number, you're operating on the visible tip of the iceberg.

When you're ready to see what work-item-level AI cost attribution looks like for your stack — including the iceberg layers that don't show up in the API bill — Runrate's 30-minute walkthrough is the next step.

Want to see this in your stack?

Book a 30-minute walkthrough with a Runrate founder.

Get a Demo

Was this article helpful?

Token Cost Calculator (Multi-Model)

Workload parameters

Cost levers

Models to compare

Cost per query, ranked

Per-model detail

Why per-token pricing is the wrong unit for budgeting (but the right unit for comparison)

Current pricing (May 2026)

How the calculator works

Worked example 1: Customer service agent at mid-market scale

Worked example 2: Complex claims adjudication

Worked example 3: AP invoice automation

Model migration ROI: when to switch

Self-hosted vs API: when infrastructure cost beats per-token cost

Cached input pricing: the largest cost optimization most teams miss

Procurement playbook: how to use this calculator

Common procurement traps the calculator surfaces

Limitations and caveats

What to do next

Want to see this in your stack?