Retrieval-augmented generation (RAG)

A pattern that fetches relevant documents at query time and feeds them into the prompt, so the model answers from your data instead of memory.

RAG retrieves relevant chunks from a knowledge base — usually via vector search — and pastes them into the context before the model answers. It grounds responses in your own, current documents rather than the model's frozen training data, which cuts hallucination and sidesteps the knowledge cutoff.

RAG systems resend large, mostly-identical context on every query, so two model traits dominate the bill and the latency: a context window big enough to hold retrieved passages, and prompt caching to avoid re-billing the shared prefix. The two together decide whether a RAG product is cheap or ruinous at scale.

RAG and long context are complements, not rivals: retrieval narrows millions of tokens down to the few thousand that matter, and a capable context window reasons over them.

Prompt caching →Longest-context LLMs →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →