modelgrep
Technical

Understanding LLM Latency and Throughput

What these metrics mean and how they affect user experience.

Latency and throughput are the two key performance metrics for LLMs—but they measure different things and matter in different situations.

Latency: Time to First Token

What it measures: How long until the model starts responding.

Latency includes network time, queue time, and the model's processing time before generating the first token. It's measured in milliseconds.

  • Sub-100ms: Feels instant. Ideal for autocomplete.
  • 100-300ms: Responsive. Good for chat.
  • 300-500ms: Noticeable delay. Acceptable for complex queries.
  • 500ms+: Feels slow. Only acceptable for batch processing.

Throughput: Tokens Per Second

What it measures: How fast the model generates output once it starts.

Throughput determines how quickly a response completes. For streaming responses, it's how fast text appears.

  • 100+ t/s: Faster than reading speed. Feels instant.
  • 50-100 t/s: Comfortable reading pace.
  • 20-50 t/s: Noticeably slow streaming.
  • Under 20 t/s: Painfully slow for streaming.

Which Matters When

Latency-Critical Use Cases

  • Autocomplete (IDE, search)
  • Real-time chat
  • Voice assistants

Throughput-Critical Use Cases

  • Streaming long responses
  • Batch processing (total time = tokens / throughput)
  • Document generation

Compare real-time performance

See live latency and throughput metrics across all models.

View Benchmarks

Why Metrics Vary

The same model can show different numbers because of:

  • Provider infrastructure: Different providers run the same model on different hardware.
  • Load: Busy servers mean higher latency and lower throughput.
  • Request size: Longer prompts take longer to process.
  • Time of day: Peak hours see more congestion.

This is why we show metrics by provider—the same model performs differently depending on where it's hosted.