modelgrep

Lowest-Latency LLMs

Quick answer · Updated June 2026

Llama Guard 4 12B has the lowest latency of any LLM, responding in about 118ms to first token. Phi 4 Mini Instruct (140ms) and Llama 3.1 8B Instruct (141ms) round out the top three.

118msLatency
18 t/sSpeed
$0.180Input /M
164KContext

AI models ranked by time-to-first-token (p50). The most responsive large language models for real-time and interactive use cases.

  1. 1M
    llama-guard-4-12b
    JSONVision$0.180/M · 18 t/s · 164K ctx
    118ms
    Latency
  2. 2M
    phi-4-mini-instruct
    JSON8.4 intel · $0.080/M · 277 t/s
    140ms
    Latency
  3. 3M
    llama-3.1-8b-instruct
    ToolsJSON11.8 intel · $0.020/M · 147 t/s
    141ms
    Latency
  4. 4S
    l3-lunaris-8b
    JSON$0.040/M · 70 t/s · 8K ctx
    147ms
    Latency
  5. 5E
    rnj-1-instruct
    ToolsJSON$0.150/M · 121 t/s · 33K ctx
    176ms
    Latency
  6. 6O
    gpt-oss-120b:free
    ReasoningTools24.5 intel · Free/M · 450 t/s
    181ms
    Latency
  7. 7O
    gpt-oss-120b
    ReasoningToolsJSON24.5 intel · $0.039/M · 450 t/s
    181ms
    Latency
  8. 8M
    codestral-2508
    ToolsJSON$0.300/M · 78 t/s · 256K ctx
    182ms
    Latency
  9. 9P
    intellect-3
    ReasoningToolsJSON22.2 intel · $0.200/M · 85 t/s
    186ms
    Latency
  10. 10Z
    glm-5
    ReasoningToolsJSON49.8 intel · $0.600/M · 112 t/s
    189ms
    Latency
  11. 11I
    granite-4.1-8b
    ToolsJSON12.4 intel · $0.050/M · 73 t/s
    197ms
    Latency
  12. 12M
    llama-3.2-11b-vision-instruct
    JSONVision8.7 intel · $0.345/M · 36 t/s
    203ms
    Latency
  13. 13M
    llama-3.3-70b-instruct:free
    Tools14.5 intel · Free/M · 98 t/s
    205ms
    Latency
  14. 14M
    llama-3.3-70b-instruct
    ToolsJSON14.5 intel · $0.100/M · 98 t/s
    205ms
    Latency
  15. 15L
    lfm-2-24b-a2b
    10.5 intel · $0.030/M · 52 t/s
    215ms
    Latency
  16. 16C
    command-r-08-2024
    ToolsJSON$0.150/M · 55 t/s · 128K ctx
    224ms
    Latency
  17. 17M
    kimi-k2.5
    ReasoningToolsJSON+137.3 intel · $0.375/M · 117 t/s
    235ms
    Latency
  18. 18O
    gpt-oss-safeguard-20b
    ReasoningToolsJSON$0.075/M · 530 t/s · 131K ctx
    235ms
    Latency
  19. 19O
    gpt-oss-20b:free
    ReasoningToolsJSON24.5 intel · Free/M · 348 t/s
    235ms
    Latency
  20. 20O
    gpt-oss-20b
    ReasoningToolsJSON24.5 intel · $0.029/M · 348 t/s
    235ms
    Latency
  21. 21C
    command-r7b-12-2024
    JSON$0.037/M · 55 t/s · 128K ctx
    239ms
    Latency
  22. 22M
    ministral-14b-2512
    ToolsJSONVision16.0 intel · $0.200/M · 55 t/s
    240ms
    Latency
  23. 23M
    phi-4
    JSON10.4 intel · $0.065/M · 44 t/s
    240ms
    Latency
  24. 24Q
    qwen3-30b-a3b-instruct-2507
    ToolsJSON15.0 intel · $0.048/M · 69 t/s
    242ms
    Latency
  25. 25C
    command-r-plus-08-2024
    ToolsJSON$2.50/M · 12 t/s · 128K ctx
    250ms
    Latency

Frequently asked

Which LLM has the lowest latency?

Llama Guard 4 12B has the lowest latency of any LLM, responding in about 118ms to first token. Phi 4 Mini Instruct (140ms) and Llama 3.1 8B Instruct (141ms) round out the top three.

What's a good alternative to Llama Guard 4 12B?

Phi 4 Mini Instruct (140ms) is the closest alternative on this metric, followed by Llama 3.1 8B Instruct (141ms). See the full ranking above for the tradeoffs.

By maker

All rankings