modelgrep

GPQA (Diamond)

A graduate-level science benchmark of questions so hard that even skilled non-experts with web access score barely above chance — a standard test of deep reasoning.

GPQA ("Graduate-Level Google-Proof Q&A") is a multiple-choice benchmark written by PhD-level domain experts in biology, physics and chemistry. The questions are deliberately "Google-proof": non-experts with unlimited web access average only ~34%, barely above the 25% random baseline.

GPQA Diamond is the hardest, highest-quality subset and the variant usually reported. Frontier models now score 70–90%+, making it one of the clearest separators between frontier and mid-tier models.

Treat GPQA as a measure of deep scientific reasoning specifically — a model can score modestly on GPQA yet still be excellent at coding, writing, or agentic work.