A model that handles more than text — most commonly accepting image input alongside text, and sometimes audio in or image/audio out.
A multimodal model accepts or produces more than one type of data. In practice this almost always means vision: the model reads images, screenshots, charts, PDFs and diagrams as naturally as text. Some also take audio input, and a few generate images or speech as output.
Vision unlocks whole categories of work text-only models can't touch — document and receipt extraction, UI understanding, chart reading, visual QA, accessibility. It's a hard requirement for those, and irrelevant for pure text pipelines, so filter for it deliberately.
"Multimodal" doesn't imply every modality. Check the specific input and output modalities on a model's page: many accept images but only emit text, and audio or image generation is still comparatively rare.