modelgrep

LLM glossary

The terms that matter when you compare AI models — defined in plain English, with the data to back them up.

Context window

The maximum amount of text (measured in tokens) an LLM can consider at once — its working memory for a single request.

Tokens per second (throughput)

The rate at which an LLM generates output, measured in tokens per second — the main determinant of how fast responses feel once they start.

Time to first token (latency)

How long an LLM takes to begin responding after receiving a request — the metric that determines how responsive a model feels.

Artificial Analysis Intelligence Index

A composite benchmark score (0–100) from Artificial Analysis that combines reasoning, knowledge, math and science evals into one comparable intelligence number.

GPQA (Diamond)

A graduate-level science benchmark of questions so hard that even skilled non-experts with web access score barely above chance — a standard test of deep reasoning.

Elo rating (for LLMs)

A ranking system from competitive chess applied to AI: models battle head-to-head on the same task and humans pick the winner, producing a relative skill score.

Prompt caching

A pricing and latency optimization where a provider reuses computation for repeated prompt prefixes, often cutting input costs by 50–90%.

Quantization

Compressing a model's weights to lower numeric precision (e.g. FP8, INT4) so it runs faster and cheaper — usually with a small quality cost.

Reasoning models

LLMs that 'think' before answering — spending extra tokens on hidden chain-of-thought to solve harder problems at the cost of latency and price.

Open weights

Models whose trained parameters are publicly downloadable — self-hostable, fine-tunable, and servable by any inference provider, unlike closed API-only models.

Parameters (model size)

The trained weights inside a model, counted in billions (B). More parameters generally means more capability — and more cost and latency to run.

Mixture of experts (MoE)

An architecture that splits a model into many specialized "expert" sub-networks and activates only a few per token — large total size, small active cost.

Function calling (tool use)

A model capability for emitting structured calls to external tools or APIs — the foundation of agents, retrieval and anything that acts on the world.

Retrieval-augmented generation (RAG)

A pattern that fetches relevant documents at query time and feeds them into the prompt, so the model answers from your data instead of memory.

Knowledge cutoff

The date after which a model has no built-in knowledge — anything more recent must be supplied via tools, search or your prompt.

Model tiers (small, mid, frontier)

Most labs ship a family at three rough tiers — a small fast one, a balanced one, and a flagship — so you can match capability to the job and the budget.

SWE-bench

A coding benchmark of real GitHub issues — the model must produce a patch that makes the repo's actual test suite pass. The closest thing to a real-world software eval.

Multimodal model

A model that handles more than text — most commonly accepting image input alongside text, and sometimes audio in or image/audio out.

Fine-tuning

Continuing training on your own examples to specialize a base model's style, format or domain — distinct from just writing a better prompt.

Temperature

A sampling setting (typically 0–2) that controls randomness: low is deterministic and focused, high is varied and creative.