A pricing and latency optimization where a provider reuses computation for repeated prompt prefixes, often cutting input costs by 50–90%.
When consecutive requests share a long common prefix — a system prompt, a document, a codebase — prompt caching lets the provider reuse the computation for that prefix instead of reprocessing it. Cached input tokens are billed at a steep discount (often 10–50% of the normal input price) and prefill latency drops too.
Caching transforms the economics of agents and RAG systems, which resend large, mostly-identical contexts on every step. An agent loop that costs $1.00 per run without caching can cost $0.15 with it.
Implementations differ: some providers cache implicitly and automatically, others require explicit cache-control markers, and cache lifetimes range from minutes to hours. Check the per-provider details on any model page.