Llama Guard 4 12B has the lowest latency of any LLM, responding in about 118ms to first token. Phi 4 Mini Instruct (140ms) and Llama 3.1 8B Instruct (141ms) round out the top three.
AI models ranked by time-to-first-token (p50). The most responsive large language models for real-time and interactive use cases.
Llama Guard 4 12B has the lowest latency of any LLM, responding in about 118ms to first token. Phi 4 Mini Instruct (140ms) and Llama 3.1 8B Instruct (141ms) round out the top three.
Phi 4 Mini Instruct (140ms) is the closest alternative on this metric, followed by Llama 3.1 8B Instruct (141ms). See the full ranking above for the tradeoffs.