AI + Technology
At 10,000 API calls per day, model selection is a financial decision. Routing every task to GPT-4o when GPT-4o mini works is the equivalent of using a crane to move a box.
As of mid-2025, frontier models cost 10–100× more per token than their smaller counterparts. At the same call volume and token count, switching from GPT-4o to GPT-4o mini reduces costs by ~94%.
Reasoning, complex generation
Classification, simple extraction
Long-form, analysis, coding
Simple tasks, high volume
Lowest cost, broad utility
The highest-leverage optimization is routing tasks to the minimum viable model for that task. Build a routing layer that classifies each request before calling the LLM.
Classification, routing, intent detection, simple extraction, chat responses
Summarization, structured data generation, moderate reasoning
Complex reasoning, creative generation, multi-step code, vision
Both Anthropic and OpenAI offer prompt caching — when the same prefix (system prompt, retrieved context) is sent repeatedly, cached tokens cost 50–90% less. For applications with long static system prompts and variable user inputs, caching can cut input costs by 60–80%.
Structure prompts so the static content (instructions, context, examples) comes first. Variable user content comes at the end. This maximizes cache hit rate.
Compress system prompts
Rewrite verbose instructions into dense, specific directives. 800-token prompts often compress to 300 without quality loss.
Limit output length
Set max_tokens to the realistic maximum for the task. An extraction task rarely needs more than 200 tokens of output.
Few-shot → zero-shot
Good models often perform as well with no examples as with 3–5 examples. Test both before adding shots.
Batch where latency allows
Batch API (OpenAI) runs at 50% cost for non-realtime workloads. For async jobs, always batch.
Applied in combination on a mature AI product, operators regularly achieve 80–95% cost reduction from an unoptimized baseline.