modelgrep

SWE-bench

A coding benchmark of real GitHub issues — the model must produce a patch that makes the repo's actual test suite pass. The closest thing to a real-world software eval.

SWE-bench draws from genuine resolved issues in popular open-source Python repos. The model gets the codebase and the issue text and must generate a patch; it's scored by whether the repository's own hidden tests then pass. That makes it far harder to game than a multiple-choice quiz — and a strong proxy for agentic coding ability.

SWE-bench Verified is the human-validated subset most often reported, since the original set contained some unsolvable or ambiguous tasks. It's a core ingredient in the Artificial Analysis Coding Index.

Treat it as a measure of autonomous bug-fixing in real repos specifically. A high SWE-bench score signals a model that can navigate a codebase and use tools, which correlates with — but isn't identical to — good interactive code completion.