AI + Technology

LLM Model Selection for Production

Using a single model for every task is like using a single tool for every job. The wrong selection doesn't just cost money — it either overpays for quality you don't need or delivers quality too low for the task.

The Three Model Tiers

Frontier$2.50–15.00/M input

GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus

Complex reasoning, creative generation, multi-step code, vision, ambiguous tasks where quality failure is expensive

Mid-Range$0.15–1.25/M input

GPT-4o mini, Claude 3 Haiku, Gemini 1.5 Pro

Structured extraction, summarization, moderate complexity Q&A, classification with nuance

Budget / Speed$0.07–0.59/M input

Gemini 1.5 Flash, Llama 3 (Groq), Claude Haiku

High-volume classification, routing, simple chat, intent detection, batch async jobs

Decision Framework: Four Questions

1. What is the cost of a quality failure?

If a bad output reaches a user and causes churn, support tickets, or wrong decisions — use frontier. If a bad output is easily caught or has low stakes — use budget tier.

2. What is the latency requirement?

Real-time chat needs sub-2s response. Budget and mid models are faster. Frontier models add 2–5s. Match model to acceptable latency.

3. What is the token volume?

At low volume, model cost difference is trivial. At 1M+ calls/day, the difference between tiers is hundreds of thousands of dollars annually.

4. Is this a well-defined or ambiguous task?

Well-defined tasks with structured outputs (extract, classify, format) don't benefit from frontier reasoning. Ambiguous synthesis tasks do.

Building a Routing Layer

A routing layer classifies each incoming request before LLM selection. The classifier itself should use a fast, cheap model. The routing logic can be rules-based, ML-based, or a small LLM call.

// Pseudo-code routing example

if complexity === 'low' and task in ['classify', 'extract']:

model = 'gpt-4o-mini'

elif task in ['summarize', 'moderate_qa']:

model = 'claude-3-haiku'

else:

model = 'claude-3-5-sonnet'

Provider Selection Beyond Cost

Context window

Claude models lead on context length (200k). Critical for document processing.

Instruction following

GPT-4o and Claude 3.5 Sonnet lead on structured output adherence (JSON, XML).

Code generation

GPT-4o leads on HumanEval benchmarks. Claude competitive on real-world codebases.

Safety / refusals

Claude models more conservative. May refuse ambiguous requests. Consider for user-facing products.

Rate limits

OpenAI higher default throughput. Anthropic/Google need tier upgrades for production volume.