modelgrep

Small & Fast LLMs

Quick answer · Updated June 2026

The small, fast LLM is gpt-oss-safeguard-20b — the efficient tier at 530 tokens/sec and $0.075 per million input tokens. It trades a few points of raw intelligence for speed and cost, the right call for high-volume, latency-sensitive work. gpt-oss-120b (free) (450 t/s) and gpt-oss-120b (450 t/s) round out the top three.

530 t/sSpeed
$0.075Input /M
131KContext

Compact, efficient models — the small/mini/flash/haiku tier — ranked by output speed. These trade a little raw intelligence for low cost and high throughput, which is the right tradeoff for chat, classification, extraction and other high-volume work.

  1. 1O
    gpt-oss-safeguard-20b
    ReasoningToolsJSON$0.075/M · 235ms ttft · 131K ctx
    530 t/s
    Speed
  2. 2O
    gpt-oss-120b:free
    ReasoningTools24.5 intel · Free/M · 181ms ttft
    450 t/s
    Speed
  3. 3O
    gpt-oss-120b
    ReasoningToolsJSON24.5 intel · $0.039/M · 181ms ttft
    450 t/s
    Speed
  4. 4O
    gpt-oss-20b:free
    ReasoningToolsJSON24.5 intel · Free/M · 235ms ttft
    348 t/s
    Speed
  5. 5O
    gpt-oss-20b
    ReasoningToolsJSON24.5 intel · $0.029/M · 235ms ttft
    348 t/s
    Speed
  6. 6I
    mercury-2
    ReasoningToolsJSON32.8 intel · $0.250/M · 303ms ttft
    328 t/s
    Speed
  7. 7Q
    qwen3-32b
    ReasoningToolsJSON16.5 intel · $0.080/M · 292ms ttft
    303 t/s
    Speed
  8. 8M
    phi-4-mini-instruct
    JSON8.4 intel · $0.080/M · 140ms ttft
    277 t/s
    Speed
  9. 9N
    nemotron-3-super-120b-a12b:free
    ReasoningToolsJSON36.0 intel · Free/M · 1.2s ttft
    240 t/s
    Speed
  10. 10N
    nemotron-3-super-120b-a12b
    ReasoningToolsJSON36.0 intel · $0.090/M · 1.2s ttft
    240 t/s
    Speed
  11. 11G
    gemini-2.5-flash-lite-preview-09-2025
    ReasoningToolsJSON+219.4 intel · $0.100/M · 397ms ttft
    202 t/s
    Speed
  12. 12N
    nemotron-3-nano-omni-30b-a3b-reasoning:free
    ReasoningToolsVision+121.4 intel · Free/M · 436ms ttft
    194 t/s
    Speed
  13. 13G
    gemini-2.5-flash-image
    JSONVisionImage out$0.300/M · 1.7s ttft · 33K ctx
    189 t/s
    Speed
  14. 14Q
    qwen3.5-35b-a3b
    ReasoningToolsJSON+130.7 intel · $0.140/M · 400ms ttft
    174 t/s
    Speed
  15. 15Q
    qwen3.6-35b-a3b
    ReasoningToolsJSON+131.5 intel · $0.150/M · 310ms ttft
    173 t/s
    Speed
  16. 16Q
    qwen3-next-80b-a3b-thinking
    ReasoningToolsJSON26.7 intel · $0.098/M · 352ms ttft
    168 t/s
    Speed
  17. 17G
    gemini-3.5-flash
    ReasoningToolsJSON+243.3 intel · $1.50/M · 1.7s ttft
    164 t/s
    Speed
  18. 18I
    ling-2.6-flash
    ToolsJSON26.2 intel · $0.010/M · 871ms ttft
    153 t/s
    Speed
  19. 19O
    o4-mini-high
    ReasoningToolsJSON+1$1.10/M · 3.5s ttft · 200K ctx
    150 t/s
    Speed
  20. 20M
    llama-3.1-8b-instruct
    ToolsJSON11.8 intel · $0.020/M · 141ms ttft
    147 t/s
    Speed
  21. 21G
    gemini-3.1-flash-image-preview
    ReasoningJSONVision+1$0.500/M · 9.8s ttft · 131K ctx
    143 t/s
    Speed
  22. 22N
    nemotron-3-nano-30b-a3b:free
    ReasoningTools24.3 intel · Free/M · 434ms ttft
    141 t/s
    Speed
  23. 23N
    nemotron-3-nano-30b-a3b
    ReasoningToolsJSON24.3 intel · $0.050/M · 434ms ttft
    141 t/s
    Speed
  24. 24Q
    qwen3-30b-a3b-thinking-2507
    ReasoningToolsJSON22.4 intel · $0.080/M · 499ms ttft
    134 t/s
    Speed
  25. 25M
    minimax-m2
    ReasoningToolsJSON36.1 intel · $0.255/M · 362ms ttft
    134 t/s
    Speed

Frequently asked

What is the smallest, fastest LLM?

The small, fast LLM is gpt-oss-safeguard-20b — the efficient tier at 530 tokens/sec and $0.075 per million input tokens. It trades a few points of raw intelligence for speed and cost, the right call for high-volume, latency-sensitive work. gpt-oss-120b (free) (450 t/s) and gpt-oss-120b (450 t/s) round out the top three.

What's a good alternative to gpt-oss-safeguard-20b?

gpt-oss-120b (free) (450 t/s) is the closest alternative on this metric, followed by gpt-oss-120b (450 t/s). See the full ranking above for the tradeoffs.

By maker

All rankings