The terms that matter when you compare AI models — defined in plain English, with the data to back them up.
The maximum amount of text (measured in tokens) an LLM can consider at once — its working memory for a single request.
The rate at which an LLM generates output, measured in tokens per second — the main determinant of how fast responses feel once they start.
How long an LLM takes to begin responding after receiving a request — the metric that determines how responsive a model feels.
A composite benchmark score (0–100) from Artificial Analysis that combines reasoning, knowledge, math and science evals into one comparable intelligence number.
A graduate-level science benchmark of questions so hard that even skilled non-experts with web access score barely above chance — a standard test of deep reasoning.
A ranking system from competitive chess applied to AI: models battle head-to-head on the same task and humans pick the winner, producing a relative skill score.
A pricing and latency optimization where a provider reuses computation for repeated prompt prefixes, often cutting input costs by 50–90%.
Compressing a model's weights to lower numeric precision (e.g. FP8, INT4) so it runs faster and cheaper — usually with a small quality cost.
LLMs that 'think' before answering — spending extra tokens on hidden chain-of-thought to solve harder problems at the cost of latency and price.
Models whose trained parameters are publicly downloadable — self-hostable, fine-tunable, and servable by any inference provider, unlike closed API-only models.
The trained weights inside a model, counted in billions (B). More parameters generally means more capability — and more cost and latency to run.
An architecture that splits a model into many specialized "expert" sub-networks and activates only a few per token — large total size, small active cost.
A model capability for emitting structured calls to external tools or APIs — the foundation of agents, retrieval and anything that acts on the world.
A pattern that fetches relevant documents at query time and feeds them into the prompt, so the model answers from your data instead of memory.
The date after which a model has no built-in knowledge — anything more recent must be supplied via tools, search or your prompt.
Most labs ship a family at three rough tiers — a small fast one, a balanced one, and a flagship — so you can match capability to the job and the budget.
A coding benchmark of real GitHub issues — the model must produce a patch that makes the repo's actual test suite pass. The closest thing to a real-world software eval.
A model that handles more than text — most commonly accepting image input alongside text, and sometimes audio in or image/audio out.
Continuing training on your own examples to specialize a base model's style, format or domain — distinct from just writing a better prompt.
A sampling setting (typically 0–2) that controls randomness: low is deterministic and focused, high is varied and creative.