Mixture of experts (MoE)

An architecture that splits a model into many specialized "expert" sub-networks and activates only a few per token — large total size, small active cost.

A mixture-of-experts model contains many expert sub-networks but routes each token to just a handful of them. A model can have 235B total parameters yet activate only ~22B per token — so it carries the knowledge of a huge model while costing closer to a small one to run.

This is why modern open-weight leaders (DeepSeek, Qwen, Llama 4, Mixtral) are nearly all MoE: it decouples capability from inference cost. The tradeoff is memory — you still have to load all the experts — and routing complexity.

When you see a model quoted as "A active / B total" (e.g. 37B active / 671B total), that's MoE. Price and speed track the active count; capability tracks the total.

Parameters →Best open-source LLMs →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →