AI Cost Optimization: A CFO's Playbook (Not an Engineer's)

5 min read · Updated 2026-05-02

Runrate Framework

The AI Cost Iceberg

Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).

Read the full framework →

When engineers talk about "AI cost optimization," they mean prompt engineering, model selection, and cache tuning. When CFOs talk about it, they should mean margin protection. These are different optimization problems. The engineer's playbook (switch from GPT-4 to GPT-4 Mini, trim the context window) might save 30% on tokens but do almost nothing for your total AI cost, because tokens are 5-10% of the iceberg. The CFO's playbook (renegotiate vendor contracts, set cost SLOs, reallocate expensive work items to cheaper models, optimize headcount) moves the needle. This is the framework most CFOs don't have yet.

The token-optimization trap

CloudZero publishes research on AI optimization. Their most popular recommendations are: use smaller models, implement prompt caching, reduce token counts, batch API calls. These are all token-level optimizations. And they help. A team that switches from GPT-4 to GPT-4 Mini for customer-support classification might save 70% on their token cost for that use case. But if their token cost is $8,000/month and they save $5,600, they've saved $5,600. If their total AI spend is $120,000/month (because the token cost tip sits on top of $112,000 of hidden infrastructure and human cost), they've moved the needle from $120k to $114.4k. That's 4.6% optimization. That's real, but that's also not the CFO's problem to solve.

The CFO's question is different: "We spend $120k/month on AI agents. We should spend $100k. What has to change to get there?" The answer isn't usually "optimize tokens." It's one of three things: renegotiate contracts, reallocate work, or cut headcount that the agents are replacing.

The CFO's playbook: Three levers

Lever 1: Vendor negotiation and benchmarking.

The AI market is moving toward outcome-based pricing. OpenAI, Anthropic, and Google publish list prices. But if you're spending $60k+/month with a vendor, you should be negotiating. The negotiation points aren't tokens per second. They're: volume discounts (we process 2M claims per year, will you give us a 15% volume discount?), exclusivity (we commit to using Claude exclusively for inference, price accordingly), and success-based pricing (our agent has to hit cost-per-ticket targets; if you're the expensive vendor, we're shopping).

Compare your cost per unit of work against known benchmarks. Klarna published that their AI support runs at $0.19 per resolved ticket. Sierra charges around $1.50. Intercom Fin is around $0.99. If your support agent is running at $2.10 per ticket and you're using GPT-4, negotiate a better rate or switch to Claude or GPT-4 Mini. If you're already on GPT-4 Mini and still at $2.10, the problem isn't the model—it's the infrastructure or the task design.

Lever 2: Per-team budget targets and reallocation.

Set a cost-per-unit target for each AI work item. Then allocate work accordingly. Example: You have three models available for claims processing:

  • GPT-4: $8 per claim (most accurate, slowest)
  • Claude Opus: $6.50 per claim (very accurate, medium speed)
  • GPT-4 Mini: $2.80 per claim (good accuracy, fast)

Don't route all claims to GPT-4. Instead, set rules: "Straightforward claims (clear documents, low ambiguity) route to GPT-4 Mini. Medium-complexity claims route to Claude Opus. Complex, high-dollar claims route to GPT-4." Your average cost per claim drops to $4.20. Your accuracy stays high because complex cases got GPT-4.

The engineer's approach would be "pick one model and optimize it." The CFO's approach is "pick the right model for the right work item." This is margin management, not token management.

Lever 3: Headcount reallocation.

This is the uncomfortable lever. If your AI agent is replacing 30% of a 15-person manual claims team, you're saving ~$1.5M/year in salary. If the agent costs $200k/year in AI spend, you're netting $1.3M in margin. Now the question is: where does the savings go? If you keep all 15 people and just have them do less work, you've created a 30% capacity cushion. You can either: redeploy them to higher-value work (training new agents, quality assurance, complex escalations); keep the bench strength for peak capacity; or right-size the headcount. Most companies do option 1: redeploy. But some do option 3. The CFO's job is to make that choice conscious, not let it happen by accident.

What renegotiation looks like: A worked example

You're a mid-market insurance company. Last year, your claims processing agent handled 50,000 claims using GPT-4. Your token cost was $28,000. Your total AI spend was $220,000 (API tokens $28k, vector DB $65k, observability $42k, human review overhead $85k). Cost per claim: $4.40.

Benchmark: Klarna's support runs at $0.19/ticket. They're probably using cheaper models for simpler tasks. Sierra charges $1.50/ticket and uses more accurate models. For claims processing, a reasonable benchmark is $2-3/claim for straightforward cases, $4-6 for complex cases. You're averaging $4.40. You're not wildly off, but you have room.

Year 2 optimization plan:

Spend $15k to fine-tune Claude on your claims data. This reduces inference cost because Claude can be more accurate with less context (shorter prompts cost less). Route 30% of claims (the straightforward ones) to Claude + fine-tuning at $2.10/claim. Route 70% to GPT-4 at $4.20/claim. New blended cost: $3.57/claim. Savings: $0.83 * 50,000 = $41,500/year.

Renegotiate with OpenAI. You're committing 50,000 claims/year, so 30,000 to GPT-4, and you want 10% volume discount. New rate: $4.20 * 0.90 = $3.78/claim for GPT-4. Savings: $0.42 * 30,000 = $12,600/year.

Optimize vector database. You're using the old retrieval strategy and storing redundant vectors. Pinecone audit + optimization: 20% reduction. Savings: $65k * 0.20 = $13,000/year.

Total savings: $41.5k + $12.6k + $13k = $67,100/year against $220k baseline spend. That's 30.5% optimization. And you didn't touch a single prompt.

What a CFO doesn't optimize (and why engineers shouldn't either)

Don't try to optimize human-review overhead or observability cost. Here's why: Human review overhead is real cost (you're paying people to check the agent's work), but it's part of the business model. Reducing it means either: accepting more errors (bad), or building more expensive AI to reduce human touch (usually fails—you end up with more cost, not less). Observability cost is also non-negotiable. If you can't see what your agent is doing or why it's failing, you'll have bigger problems than the observability bill.

Instead, optimize: Vendor contracts (biggest opportunity, 10-20% typical savings), model routing (picking the right tool for the right task, 15-25% savings), and infrastructure efficiency (vector DB tuning, caching, 10-15% savings).

Build a CFO-grade optimization dashboard

Your optimization dashboard should show:

  • Cost per unit of work by agent (broken down by model, by vendor)
  • Actual cost vs. budget for each agent
  • Benchmark comparison (your cost per claim vs. known benchmarks)
  • Savings opportunities ranked by effort and impact
  • Headcount-to-AI-cost mapping (are you actually realizing headcount savings?)

Update it monthly. Review it with your CFO, COO, and head of AI. Use it to make decisions, not just to observe.

Explore the full FinOps for AI framework in the pillar article.

Go deeper with the field guide.

A step-by-step PDF for implementing AI cost attribution.

Download the Guide

Was this article helpful?