Runrate Framework
The AI Cost Iceberg
Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).
Read the full framework →Build or buy? That is the permanent question in tech. With AI, the question is more specific: Should you use an API (rent a model), fine-tune your own, or self-host?
Each model has different cost, risk, and lock-in characteristics. The CFO needs to understand the trade-offs because the choice cascades through your entire budget and your ability to change vendors.
The Three Models At a Glance
| Factor | API (OpenAI, Anthropic, Google) | Fine-Tuned Model | Self-Hosted | |--------|-----------|----------|-----------| | Setup cost | $0 | $20k-$100k | $50k-$500k | | Monthly COGS (1M inferences) | $3,000-$10,000 | $1,500-$3,000 | $2,000-$5,000 | | Latency | 1-5 sec | 1-5 sec | <1 sec | | Data privacy | Vendor has access | Depends on contract | You own the data | | Lock-in | High | Medium | Low | | Customization | Low | Medium | High | | Time to production | Days | Weeks | Months |
Model 1: API (Rent-a-Model)
You send your data to OpenAI, Anthropic, or Google Cloud. They run the inference. You pay by token.
Cost structure:
- Setup: $0 (you just get an API key)
- Per-inference cost: $0.003 (input) + $0.015 (output) per token = ~$0.045 per customer support conversation
- Infrastructure: $0 (vendor handles it)
- Observability: $1,000-$5,000/month (Langfuse, Helicone, or custom)
- Human review: 5-10% of queries, at $5-$10 per review = $0.25-$0.50 per inference
- Total COGS per inference: ~$0.30-$0.60
At scale (1M inferences/month):
- Token cost: $45,000
- Observability: $3,000
- Human review: $250,000-$500,000 (assuming 1% of queries reviewed)
- Total: $300,000-$550,000/month
This looks expensive until you remember: you are paying a vendor to handle all of the infrastructure. The vendor owns the GPUs, the uptime, the model improvements. You do not have to hire an MLOps team.
Pros:
- Fastest to market. You can go live in days.
- No capital cost. It is all opex.
- No lock-in to infrastructure. You can switch vendors (though it hurts).
- Vendor handles model improvements. You get GPT-4.5 for free when they release it.
Cons:
- Vendor has access to your data (unless you negotiate a contract clause). Many companies cannot use OpenAI API for regulated data (healthcare, finance, legal).
- Per-token pricing means the vendor benefits when you scale. As your volume grows, the vendor's revenue grows faster.
- Unpredictable costs. If your prompts get longer or you need more retries, costs spike.
- Vendor control. OpenAI can raise prices (which they have done multiple times) and you have no recourse.
Best for: Startups, proof-of-concepts, low-data-sensitivity use cases (content generation, code summaries, general Q&A).
Model 2: Fine-Tuned Model
You take a base model (Llama 2, Mistral, or a smaller Claude), fine-tune it on your proprietary data, and deploy it on a vendor's infrastructure (Together.ai, Replicate, Lambda Labs).
Cost structure:
- Setup: $50,000-$150,000 (data preparation, fine-tuning training, validation)
- Per-inference cost: $0.002 per token (cheaper than API because smaller, faster model)
- Infrastructure: $1,000-$3,000/month (hosting the fine-tuned model on vendor infrastructure)
- Observability: $1,000-$5,000/month
- Human review: Still 5-10% of queries
- Total COGS per inference: ~$0.25-$0.50
At scale (1M inferences/month):
- Token cost: $20,000 (smaller model, cheaper pricing)
- Infrastructure: $2,000
- Observability: $3,000
- Human review: $250,000 (assuming 1% reviewed)
- Total: $275,000/month
Notice: it is slightly cheaper per inference than the API model, but only if your fine-tuning actually works. If your fine-tuned model is worse than the base model, you are paying more for a degraded experience.
Pros:
- Slightly cheaper per inference than API.
- Your data stays on your infrastructure (if you self-host the fine-tuning). Easier compliance.
- You can have a smaller, cheaper model. You do not need GPT-4 if Llama 2 fine-tuned on your data works as well.
- More control. You own the trained model (usually).
Cons:
- Higher upfront cost. You have to invest in data, training, and validation before you see ROI.
- Ongoing maintenance cost. The base model improves; your fine-tuned version gets stale. You need to re-fine-tune periodically.
- Still reliant on vendor infrastructure. You are paying Replicate or Together.ai for hosting and you still have vendor lock-in.
- Difficult to evaluate. "Is our fine-tuned Llama 2 as good as Claude?" requires careful A/B testing, which costs time and money.
Best for: Mid-market companies with domain-specific use cases (claims processing, legal document review, vertical-specific customer support) where fine-tuning can give you a meaningful edge.
Model 3: Self-Hosted
You run the model on your own GPUs (on AWS, Azure, GCP, or on-premises).
Cost structure:
- Setup: $100,000-$300,000 (GPU infrastructure, MLOps tooling, team hiring)
- Per-inference cost: $0.001 per token (very cheap, you own the hardware)
- Infrastructure: $20,000-$50,000/month (GPUs, networking, storage, redundancy)
- Operations: $50,000-$150,000/month (MLOps team, monitoring, security)
- Observability: $5,000-$10,000/month
- Human review: 5-10% of queries
- Total COGS per inference: ~$0.15-$0.40
At scale (1M inferences/month):
- Token cost: $1,000 (almost free; you own the GPU)
- Infrastructure: $40,000
- Operations team: $100,000
- Observability: $7,000
- Human review: $250,000-$500,000
- Total: $400,000-$650,000/month
This is deceptive. Self-hosted looks cheaper on token cost, but operational overhead is huge. You now have a team responsible for uptime, security, model versioning, and compliance.
Pros:
- Total data privacy. Your data never leaves your infrastructure.
- No vendor lock-in. You run open-source models (Llama, Mistral, Qwen) and can switch easily.
- Latency is excellent. Sub-second inference is possible.
- No per-token vendor pricing. If you have lots of traffic, costs flatten out (you paid for the GPU, it is sunk cost).
Cons:
- High upfront cost, high ongoing cost. You are paying for infrastructure and team whether you use it or not.
- Operational burden. You now own uptime, security, monitoring, and compliance.
- Model lag. You are using Llama 2; Llama 3 and 4 have been released and are better. You have to evaluate, migrate, and re-integrate.
- Difficult to achieve cost efficiency at small scale. You need a lot of traffic to amortize the $40k/month infrastructure cost.
Best for: Enterprise companies with massive scale (10M+ inferences/month), strict data privacy requirements (regulated industries), or long-term commitment to AI as a core capability.
The Break-Even Analysis
When should you move from API to fine-tuned, and from fine-tuned to self-hosted?
API to fine-tuned break-even: When the upfront cost of fine-tuning ($100k) is less than the annual savings from cheaper per-inference cost, divided by 12.
If API costs $300/query and fine-tuned costs $250/query, you save $0.05 per query. At 1M queries/month, that is $50,000/month or $600,000/year. The $100k fine-tuning cost pays for itself in 2 months.
But most companies never reach this volume. Most AI deployments are at 10k-100k inferences per month, where API is cheaper.
Fine-tuned to self-hosted break-even: When the upfront cost of self-hosting ($200k) plus the annual operational cost ($1.5M) is less than the annual vendor cost.
If fine-tuned vendor costs $3,000/month ($36k/year) and self-hosted costs $300k upfront + $1.5M/year = $1.8M first year, you do not break even until year 5+. And that assumes you do not need a team larger than the one you budgeted.
The Hidden Flip: Data Privacy and Compliance
The calculation changes dramatically if you have data privacy or compliance requirements. If you cannot send customer data to OpenAI (because you are in healthcare, finance, or legal services), the API option is off the table.
In that case, the choice is fine-tuned on a private vendor infrastructure vs. self-hosted. The cost difference shrinks, and self-hosting becomes more attractive even at smaller scale.
What To Do Next
Ask your AI team: How many inferences per month do we run today? What is our growth trajectory? What are the data privacy requirements?
- If you are under 100k inferences/month and have no data privacy constraints, use the API. It is the fastest, cheapest, lowest-risk option.
- If you are at 100k-1M inferences/month and have some privacy concerns, evaluate fine-tuned models. Run a 3-month pilot to validate the quality vs. API, then make the decision.
- If you are at 1M+ inferences/month or have strict privacy requirements, start scoping self-hosted. It is a 3-6 month project with significant operational commitment.
The decision tree is not "which is cheapest" but "which lets us scale fastest while respecting our constraints."
Go deeper with the field guide.
A step-by-step PDF for implementing AI cost attribution.
Was this article helpful?
Related in this cluster
AI Economics & Unit Economics
The New Economics of AI, Explained for Finance Leaders
AI Economics & Unit Economics
Why AI Is More Expensive Than Software (And Why That's Permanent)
AI Economics & Unit Economics