Time to first token (latency)

How long an LLM takes to begin responding after receiving a request — the metric that determines how responsive a model feels.

Time to first token (TTFT) is the delay between sending a prompt and receiving the first piece of the response. It covers network time, queueing, and "prefill" — the model processing your entire input before it can generate anything.

Users perceive responses under ~300ms as instant and over ~1s as sluggish, so TTFT is the key metric for chat interfaces, autocomplete and any interactive product. Long prompts increase TTFT because prefill scales with input length; reasoning models can add seconds of hidden thinking before the first visible token.

TTFT and throughput trade off differently by use case: interactive products should optimize TTFT first, batch pipelines can ignore it entirely.

Lowest-latency LLMs →Tokens per second →

More terms

Context window →Tokens per second (throughput) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →Prompt caching →