Understanding LLM Latency and Throughput

Latency and throughput are the two key performance metrics for LLMs—but they measure different things and matter in different situations.

Latency: Time to First Token

What it measures: How long until the model starts responding.

Latency includes network time, queue time, and the model's processing time before generating the first token. It's measured in milliseconds.

What it measures: How fast the model generates output once it starts.

Throughput determines how quickly a response completes. For streaming responses, it's how fast text appears.

Compare real-time performance

See live latency and throughput metrics across all models.

View Benchmarks

The same model can show different numbers because of:

Provider infrastructure: Different providers run the same model on different hardware.
Load: Busy servers mean higher latency and lower throughput.
Request size: Longer prompts take longer to process.
Time of day: Peak hours see more congestion.

This is why we show metrics by provider—the same model performs differently depending on where it's hosted.