Llama 3 8B Lunaris has the lowest latency of any Sao10K model, responding in about 126ms to first token. Llama 3.1 Euryale 70B v2.2 (394ms) and Llama 3.1 70B Hanami x1 (641ms) round out the top three.
AI models ranked by time-to-first-token (p50). The most responsive large language models for real-time and interactive use cases.
Llama 3 8B Lunaris has the lowest latency of any Sao10K model, responding in about 126ms to first token. Llama 3.1 Euryale 70B v2.2 (394ms) and Llama 3.1 70B Hanami x1 (641ms) round out the top three.
Llama 3.1 Euryale 70B v2.2 (394ms) is the closest alternative on this metric, followed by Llama 3.1 70B Hanami x1 (641ms). See the full ranking above for the tradeoffs.
modelgrep tracks 4 Sao10K models with live benchmarks, speed, latency and per-provider pricing. 4 of them qualify for this ranking.