The rate at which an LLM generates output, measured in tokens per second — the main determinant of how fast responses feel once they start.
Throughput measures how quickly a model produces output once it starts generating. At 30 tokens/sec a long answer streams in slowly; at 200+ tokens/sec it appears nearly instantly. A token is roughly ¾ of an English word.
Throughput depends on both the model (smaller and quantized models are faster) and the inference provider's hardware. The same open-weight model can run at 40 t/s on one provider and 400 t/s on a provider using specialized accelerators.
Throughput matters most for long outputs — code generation, long-form writing, agent loops that read their own output. For short answers, latency (time to first token) dominates perceived speed instead.