What is Multimodal AI and Why It Changes Cost Models

5 min read · Updated 2026-05-02

Runrate Framework

The AI Cost Iceberg

Visible API spend (10%) vs hidden inference, storage, observability, retries, human review (90%).

Read the full framework →

For most of 2023 and 2024, the AI narrative was text-only: ChatGPT takes text input, produces text output. But GPT-4V (GPT-4 Vision), Claude 3.5 Sonnet with vision, and Gemini 2.0 can now process images. And audio models like Whisper and emerging speech-to-text systems can ingest audio. This is multimodal AI — models that handle text, image, audio, and potentially video in a single inference. From a cost perspective, this changes everything.

Multimodal Means Higher Per-Token Costs (and More Tokens)

A text token costs $0.003-$0.005. An image token is priced completely differently. OpenAI's vision pricing charges $0.01375 per 1,000 image tokens for low-detail images and $0.0275 per 1,000 for high-detail. Claude 3.5 Sonnet charges based on image size: $0.48 per 1MB of image data, or roughly $0.001 per image token (cheaper than OpenAI but requires manual estimation). Gemini 2.0 charges $0.075 per 1M input tokens (text) but the same rate for images, a 15x premium on the text rate.

Here's the catch: images tokenize into far more tokens than you'd expect. A typical 1080p photograph costs roughly 2,000-3,000 tokens if you use the detailed version in GPT-4V. A single page of a scanned document (healthcare claim, insurance form, mortgage application) can be 500-2,000 tokens depending on resolution. If you're processing thousands of images per day — medical imaging AI, document processing, quality control in manufacturing — the token cost multiplies.

Let's model a real example: a health insurance underwriting system that analyzes medical documents (X-rays, lab results, notes) along with the written claim. A typical claim might include 3 images (each 2,000 tokens at detail level) plus 1,000 tokens of claim text. Using Claude 3.5 Sonnet, the inference cost would be (3 × $0.48) + (1,000 × $0.003 / 1,000) = $1.44 + $0.003 = $1.44 per claim. If you process 100 claims per day, that's $144/day just in image processing cost — $4,320/month. Add the hidden costs (retries, human review, observability), and you're quickly at $8,000-$12,000/month for a team that might previously have cost $6,000/month.

Audio Adds Another Cost Dimension

Speech-to-text models like OpenAI's Whisper are opening up a new use case: customer calls, recorded meetings, phone support conversations transcribed to text and analyzed by an LLM. But audio is expensive. OpenAI charges $0.02 per minute of Whisper transcription. If you're processing 500 customer calls per month averaging 10 minutes each (5,000 minutes), that's $100/month in transcription alone. Scale to 10,000 calls/month and you're at $2,000/month — before you even analyze the transcripts with an LLM.

The hidden cost: audio models introduce latency. Real-time speech-to-text requires streaming, which is more expensive than batch transcription. Some vendors charge 2x for real-time audio vs. offline. If you're building a phone agent, the audio processing cost can exceed the LLM cost.

The AI Cost Iceberg Gets Deeper With Multimodal

This is where the AI Cost Iceberg principle becomes critical. Most finance teams budget image processing as a simple line item: "Computer vision $5K/month." But they're not budgeting for:

  • Retry cost: If an image fails to process (network error, model hallucination), the system retries. Retries multiply cost by 1.5-3x depending on your error rates.
  • Human review: Multimodal outputs (especially medical, legal, financial decisions) require human verification. A human radiologist reviewing an AI diagnosis takes 5 minutes. At $100/hour, that's $8.33 per image reviewed. If you process 100 images/day, that's $833/day in review cost — $18,000/month.
  • Quality evaluation: You need to test whether the model is correctly interpreting images. This requires labeled datasets, human scorers, and continuous evaluation. Budget $10K-$50K upfront.
  • Storage and retrieval: Images need to be stored somewhere (S3, cloud storage) and retrieved when needed. This adds egress costs and latency.

The visible cost (image tokens) might be 20% of the true cost. The hidden cost lives in human review, retries, and quality assurance.

Document Processing as a Case Study

Document processing is the canonical multimodal use case: your claims team gets PDFs, your mortgage team gets application packages, your legal team gets contracts. Instead of having a human read each one, you send it to a multimodal AI.

A typical mortgage application PDF is 10-20 pages. At 100 tokens per page (reasonable for scanned documents), that's 1,000-2,000 tokens per document. If you process 200 mortgage applications per day, that's 200,000-400,000 tokens/day, or 6-12 million tokens/month. At GPT-4V high-detail pricing ($0.0275/1K), that's $165-$330/month in visible cost. Hidden cost: a human mortgage underwriter needs to review the AI's extraction (did it correctly pull the income figures, employment history, debt obligations?). A 15-minute review at $50/hour is $12.50 per application. For 200 applications/day, that's $2,500/day = $50,000/month in review cost.

The true economics: processing a mortgage application with AI costs $0.06 (tokens) + $12.50 (human review) = $12.56. Without multimodal AI, a human processor costs $20/application (40 minutes at $30/hour). AI wins on cost. But only if you account for the human review time. If you budget only the token cost, you're flying blind.

Optimization Levers for Multimodal Systems

If you're deploying multimodal AI, here are the cost levers:

1. Image compression: Lower-resolution images reduce token count. A 1,920x1,080 image might be 3,000 tokens; downsampled to 960x540 it might be 500 tokens. Quality matters — but for many use cases, lower quality is fine.

2. Batch processing: If your images don't need real-time analysis, batch them overnight at cheaper rates (OpenAI batch API, Anthropic batch pricing when available).

3. Semantic caching: If the same image appears in multiple inferences (a standard form, a frequently-used document), cache it to avoid reprocessing.

4. Structured extraction: Instead of asking the model to analyze an image and write a long response, ask it to extract specific fields into a JSON structure. Structured output reduces token cost and improves quality.

5. Reduce human review: The biggest cost lever is improving model accuracy so humans don't need to review every output. This is evaluation cost upfront, but savings downstream.

The CFO principle: multimodal AI is not cheaper than text-only AI. It's more useful for certain workflows (document processing, medical imaging, quality control). But the cost multiplier is 10-50x higher than text. Budget accordingly, or face bill shock when you deploy at scale.

Curious where your team sits on the 5-Stage AI Cost Maturity Curve? Take the 15-question self-assessment and get a personalized report on your path to work-item-level cost attribution for multimodal AI workloads.

Where does your team sit on the maturity curve?

Take the 15-question self-assessment and get a personalized report.

Start the Assessment

Was this article helpful?