Model tiers (small, mid, frontier)

Most labs ship a family at three rough tiers — a small fast one, a balanced one, and a flagship — so you can match capability to the job and the budget.

Nearly every lab ships a tiered family rather than one model. Anthropic has Haiku / Sonnet / Opus; OpenAI has nano / mini / full; Google has Flash / Pro. The pattern is the same: a small, fast, cheap tier for high volume; a flagship for the hardest reasoning; and a balanced middle that's the right default for most production work.

The small tier is the workhorse most teams underuse. For chat, classification, extraction and routing, a Haiku- or Flash-class model is often 5–20× cheaper and several times faster than the flagship, at quality the task can't actually distinguish. Reach for the flagship only where reasoning genuinely gates the outcome.

A common production pattern is to route by difficulty — cheap small model first, escalate to the flagship only when confidence is low — which captures most of the quality at a fraction of the cost.

Small & fast LLMs →LLM speed vs cost →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →