Multimodal model

A model that handles more than text — most commonly accepting image input alongside text, and sometimes audio in or image/audio out.

A multimodal model accepts or produces more than one type of data. In practice this almost always means vision: the model reads images, screenshots, charts, PDFs and diagrams as naturally as text. Some also take audio input, and a few generate images or speech as output.

Vision unlocks whole categories of work text-only models can't touch — document and receipt extraction, UI understanding, chart reading, visual QA, accessibility. It's a hard requirement for those, and irrelevant for pure text pipelines, so filter for it deliberately.

"Multimodal" doesn't imply every modality. Check the specific input and output modalities on a model's page: many accept images but only emit text, and audio or image generation is still comparatively rare.

Best vision LLMs →LLM leaderboard →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →