Runrate Framework
The AI Cost Iceberg
Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).
Read the full framework →You budget for AI based on token usage. You forecast 1 million tokens per month, at $3 per million = $3,000. You should be safe.
Then your token usage doubles to 2 million. Your bill doubles to $6,000. Good. But your infrastructure bill, your latency costs, and your error handling costs have tripled. Your AI bill grew nonlinearly.
This is the inference cost problem. Token cost is linear. Everything else about running inference at scale is not.
What Inference Actually Costs
Inference is the act of running the model — taking your input (the prompt) and generating output (the response). It happens every single time someone uses your AI.
There are two ways to pay for inference:
Option 1: Pay the vendor per token (API pricing)
You send your request to OpenAI or Anthropic. They run it on their infrastructure. You pay $3 per million input tokens, $15 per million output tokens. You do not see or manage the infrastructure.
Option 2: Run your own model on your own infrastructure (self-hosted)
You rent GPUs from AWS, Google Cloud, or a specialist provider like Together.ai or Lambda Labs. You run the model yourself. You pay for compute time, not for tokens.
For self-hosted inference, a single GPU (like an NVIDIA A100) costs $2-$4 per hour on cloud. An inference cluster to handle enterprise load might be 8-16 GPUs. That is $16-$64/hour. For 24/7 operation, that is $140k-$560k per month.
But there are hidden costs on both paths.
The API Path: Hidden Costs Beyond Tokens
When you use OpenAI or Anthropic via API, you see the token bill. You do not see:
Latency costs. If your customer-support AI takes 5 seconds to respond (because it is hitting the OpenAI API over the network), your customer satisfaction score falls. You lose revenue from lower conversion, higher churn, or lower NPS. This is a hidden cost that does not show up on the API bill.
To reduce latency, you either pay for premium API access (OpenAI's batch priority tier), or you move to self-hosted (high capital cost), or you accept the latency hit (revenue loss). There is no free option.
Retry costs. When an API call fails (timeout, rate limit, server error), you retry. Each retry is a billable token event. If 5% of your requests fail once and retry, you just added 5% to your token bill invisibly.
CloudZero found that retry costs add 3-8% to token budgets in production AI systems. This is because most companies do not instrument retry rates or error budgets.
Observability costs. You need to monitor which requests are slow, which are failing, which are expensive. Tools like Helicone, Langfuse, or LangSmith cost $300-$5,000/month depending on volume. At scale, observability can be 5-10% of token cost.
Vector database and caching. If your AI is doing retrieval-augmented generation (RAG) — searching past data to provide context — you need a vector database. Pinecone, Weaviate, or Milvus costs $500-$20,000/month depending on index size and query volume. This scales with your usage, not with your token spend.
Example: A financial services company with 50,000 customer interactions per month using Claude + Pinecone:
- Tokens: $50,000
- Pinecone: $5,000
- Observability (Langfuse): $1,000
- Rate-limit management and retry tooling: $500
- Total: $56,500
Tokens are 88% of the visible cost, but the hidden costs are real.
The Self-Hosted Path: Nonlinear Infrastructure Costs
If you run your own model on GPUs, token costs disappear. But infrastructure costs explode in a nonlinear way.
Scenario: You run a Llama 2 70B model for claims adjudication
-
Baseline: One A100 GPU costs $2/hour = $1,460/month (24/7). You can handle maybe 1,000 inferences per hour, or 1M per month. Cost: $1.46 per 1,000 inferences.
-
10x scale: You need 10 GPUs to handle 10M inferences/month. Cost: $14,600/month. Cost per 1,000 inferences: still $1.46 if you run at perfect utilization.
But you do not run at perfect utilization. In reality:
- You need redundancy. If one GPU fails, your service goes down. You add 20% extra capacity. Now it is 12 GPUs.
- You need burst capacity for peak times. You add 30% headroom for spikes. Now it is 15 GPUs.
- You need monitoring, logging, and on-call support. That is $5,000-$10,000/month in labor and tooling.
- You need to fine-tune the model for your specific domain, which requires another $2,000-$5,000/month in compute and labor.
Real cost for 10M inferences per month: $25,000-$30,000/month, or $2.50-$3.00 per 1,000 inferences.
Suddenly, Anthropic Claude's $3 per million input tokens does not look so expensive.
The Nonlinearity: Where It Comes From
Token cost scales linearly. Infrastructure cost does not. Here is why:
-
Redundancy and headroom are sunk costs. Adding 20% to your cluster to handle failures is a fixed cost, regardless of volume. If you have 100 queries/month or 1M, you still need that redundancy.
-
Observability and monitoring tools have tier breaks. A monitoring tool might cost $500/month up to 100M events, then $2,000/month for 100M-1B events, then $5,000/month for 1B+. You hit these tier breaks as you scale, and the cost jumps.
-
Human review and compliance are superlinear. If you are running AI in regulated industries (healthcare, finance, insurance), you need humans to review edge cases or uncertain outputs. As you scale from 1,000 to 10,000 inferences per day, the percentage of inferences needing human review might stay at 1%, but the absolute number of reviews grows. And human time is expensive. At scale, human review becomes your biggest cost.
-
Data and model management become specialized. At small scale, one engineer manages your fine-tuned model. At 10M inferences/month, you need a dedicated MLOps team ($200k-$400k/year), a data pipeline ($50k-$100k/year), and model versioning and governance tools ($20k-$50k/year).
A Worked Example: The Nonlinear Bill
A health insurance company deploying AI claims adjudication:
Month 1: 10,000 claims (pilot)
- Token cost (Claude API): $50
- Vector database and observability: $500 (minimum tier)
- Human review of 100 edge cases: $5,000
- Total: $5,550. Cost per claim: $0.555
Month 6: 100,000 claims (expanding)
- Token cost: $500
- Infrastructure and observability: $2,000
- Human review of 500 edge cases: $25,000
- Total: $27,500. Cost per claim: $0.275
Month 12: 500,000 claims (production scale)
- Token cost: $2,500
- Infrastructure and observability (plus tooling): $10,000
- Human review of 2,000 edge cases: $100,000
- MLOps and data management: $15,000
- Total: $127,500. Cost per claim: $0.255
Token cost fell from 1% to 2% as a percentage. But human review grew from 90% to 78%, and specialized infrastructure grew from 9% to 25%. The cost-per-claim looks better, but the absolute cost grows faster than claims volume.
What CFOs Should Do
-
Do not forecast based on tokens alone. When your AI team says "we will scale to 10M inferences at $0.003 per token," ask "and what about observability, retries, vector database, and human review?"
-
Build a cost budget by category. Not just "AI budget $100k/month," but "Tokens $30k, Observability $5k, Vector database $10k, Human review $40k, Infrastructure $15k." As volume scales, forecast how each line item will grow.
-
Plan for nonlinear growth. If you forecast 10M inferences per month next year, do not assume the cost per inference stays the same. Assume it falls initially (better utilization), then rises again (compliance, redundancy, specialization).
-
Measure the hidden costs. Instrument retry rates. Measure vector database queries per request. Track human review percentage. These are where the nonlinearity lives.
Token cost is simple and scales predictably. Everything else about inference at scale does not. Build your financial model around that reality.
Go deeper with the field guide.
A step-by-step PDF for implementing AI cost attribution.
Was this article helpful?