Best StepFun Vision Models

Match · Updated July 2026

Step 3.7 Flash is the best vision-capable StepFun model, pairing 30.3 intelligence with image and document understanding.

30.3Intelligence

$0.200Input /M

262KContext

Multimodal large language models that accept image input, ranked by intelligence. The best vision language models (VLMs) for understanding images, documents and charts.

1
step-3.7-flash
ReasoningToolsJSON+130.3 intel · $0.200/M · 262K ctx
30.3
Intelligence

Frequently asked

What is the best StepFun model for vision?

Step 3.7 Flash is the best vision-capable StepFun model, pairing 30.3 intelligence with image and document understanding.

How many StepFun models are there?

modelgrep tracks 2 StepFun models with live benchmarks, speed, latency and per-provider pricing, led on intelligence by Step 3.7 Flash. 1 of them qualify for this ranking.

More StepFun rankings

All rankings

Small & Fast LLMs Best Local LLMs Smartest LLMs Best LLMs for Coding Best LLMs for Design & Frontend Fastest LLMs Lowest-Latency LLMs Cheapest LLMs Best Free LLMs Best Reasoning LLMs Best LLMs for Agents Best Open-Source LLMs Longest-Context LLMs Best LLMs for Writing Best LLMs for Math & Science Best LLMs for RAG Best LLMs for SQL & Data Analysis Best LLMs for Roleplay Best Uncensored LLMs