Prompt caching

A pricing and latency optimization where a provider reuses computation for repeated prompt prefixes, often cutting input costs by 50–90%.

When consecutive requests share a long common prefix — a system prompt, a document, a codebase — prompt caching lets the provider reuse the computation for that prefix instead of reprocessing it. Cached input tokens are billed at a steep discount (often 10–50% of the normal input price) and prefill latency drops too.

Caching transforms the economics of agents and RAG systems, which resend large, mostly-identical contexts on every step. An agent loop that costs $1.00 per run without caching can cost $0.15 with it.

Implementations differ: some providers cache implicitly and automatically, others require explicit cache-control markers, and cache lifetimes range from minutes to hours. Check the per-provider details on any model page.

Cheapest LLMs →LLM pricing guide →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →