AI + Technology

LLM Cost Optimization at Scale

At 10,000 API calls per day, model selection is a financial decision. Routing every task to GPT-4o when GPT-4o mini works is the equivalent of using a crane to move a box.

The Cost Spread Between Models

As of mid-2025, frontier models cost 10–100× more per token than their smaller counterparts. At the same call volume and token count, switching from GPT-4o to GPT-4o mini reduces costs by ~94%.

GPT-4o$2.50/M in · $10.00/M out

Reasoning, complex generation

GPT-4o mini$0.15/M in · $0.60/M out

Classification, simple extraction

Claude 3.5 Sonnet$3.00/M in · $15.00/M out

Long-form, analysis, coding

Claude 3 Haiku$0.25/M in · $1.25/M out

Simple tasks, high volume

Gemini 1.5 Flash$0.075/M in · $0.30/M out

Lowest cost, broad utility

Strategy 1: Task-Based Model Routing

The highest-leverage optimization is routing tasks to the minimum viable model for that task. Build a routing layer that classifies each request before calling the LLM.

Mini tier90–94% vs. frontier

Classification, routing, intent detection, simple extraction, chat responses

Mid tier70–85% vs. frontier

Summarization, structured data generation, moderate reasoning

Frontier tierBaseline

Complex reasoning, creative generation, multi-step code, vision

Strategy 2: Prompt Caching

Both Anthropic and OpenAI offer prompt caching — when the same prefix (system prompt, retrieved context) is sent repeatedly, cached tokens cost 50–90% less. For applications with long static system prompts and variable user inputs, caching can cut input costs by 60–80%.

Structure prompts so the static content (instructions, context, examples) comes first. Variable user content comes at the end. This maximizes cache hit rate.

Strategy 3: Token Reduction

Compress system prompts

Rewrite verbose instructions into dense, specific directives. 800-token prompts often compress to 300 without quality loss.

Limit output length

Set max_tokens to the realistic maximum for the task. An extraction task rarely needs more than 200 tokens of output.

Few-shot → zero-shot

Good models often perform as well with no examples as with 3–5 examples. Test both before adding shots.

Batch where latency allows

Batch API (OpenAI) runs at 50% cost for non-realtime workloads. For async jobs, always batch.

Combined Savings Potential

Task routing to mini tier−85–94%
Prompt caching (long static prompts)−40–60% on input
Token compression−20–40%
Batch API for async jobs−50%

Applied in combination on a mature AI product, operators regularly achieve 80–95% cost reduction from an unoptimized baseline.