modelgrep

Tokens per second (throughput)

The rate at which an LLM generates output, measured in tokens per second — the main determinant of how fast responses feel once they start.

Throughput measures how quickly a model produces output once it starts generating. At 30 tokens/sec a long answer streams in slowly; at 200+ tokens/sec it appears nearly instantly. A token is roughly ¾ of an English word.

Throughput depends on both the model (smaller and quantized models are faster) and the inference provider's hardware. The same open-weight model can run at 40 t/s on one provider and 400 t/s on a provider using specialized accelerators.

Throughput matters most for long outputs — code generation, long-form writing, agent loops that read their own output. For short answers, latency (time to first token) dominates perceived speed instead.