Latency and throughput are the two key performance metrics for LLMs—but they measure different things and matter in different situations.
Latency: Time to First Token
What it measures: How long until the model starts responding.
Latency includes network time, queue time, and the model's processing time before generating the first token. It's measured in milliseconds.
- Sub-100ms: Feels instant. Ideal for autocomplete.
- 100-300ms: Responsive. Good for chat.
- 300-500ms: Noticeable delay. Acceptable for complex queries.
- 500ms+: Feels slow. Only acceptable for batch processing.
Throughput: Tokens Per Second
What it measures: How fast the model generates output once it starts.
Throughput determines how quickly a response completes. For streaming responses, it's how fast text appears.
- 100+ t/s: Faster than reading speed. Feels instant.
- 50-100 t/s: Comfortable reading pace.
- 20-50 t/s: Noticeably slow streaming.
- Under 20 t/s: Painfully slow for streaming.
Which Matters When
Latency-Critical Use Cases
- Autocomplete (IDE, search)
- Real-time chat
- Voice assistants
Throughput-Critical Use Cases
- Streaming long responses
- Batch processing (total time = tokens / throughput)
- Document generation
Compare real-time performance
See live latency and throughput metrics across all models.
View BenchmarksWhy Metrics Vary
The same model can show different numbers because of:
- Provider infrastructure: Different providers run the same model on different hardware.
- Load: Busy servers mean higher latency and lower throughput.
- Request size: Longer prompts take longer to process.
- Time of day: Peak hours see more congestion.
This is why we show metrics by provider—the same model performs differently depending on where it's hosted.