Capability Benchmark GPT-5.2-Thinking Claude-Opus-4.5 Gemini 3 Pro DeepSeek V3.2 Qwen3-Max-Thinking Knowledge MMLUPro 87.4 89.5 *89.8* 85.0 85.7 Knowledge MMLURedux 95.0 95.6 *95.9* 94.5 92.8 Knowledge CEval 90.5 92.2 93.4 92.9 *93.7* STEM GPQA *92.4* 87.0 91.9 82.4 87.4 STEM HLE 35.5 30.8 *37.5* 25.1 30.2 Reasoning LiveCodeBench v6 87.7 84.8 *90.7* 80.8 85.9 Reasoning HMMT Feb 25 *99.4* - 97.5 92.5 98.0 Reasoning HMMT Nov 25 - - 93.3 90.2 *94.7* Reasoning IMOAnswerBench *86.3* 84.0 83.3 78.3 83.9 Agentic Coding SWE Verified 80.0 *80.9* 76.2 73.1 75.3 Agentic Search HLE (w/ tools) 45.5 43.2 45.8 40.8 *49.8* Instruction Following & Alignment IFBench *75.4* 58.0 70.4 60.7 70.9 Instruction Following & Alignment MultiChallenge 57.9 54.2 *64.2* 47.3 63.3 Instruction Following & Alignment ArenaHard v2 80.6 76.7 81.7 66.5 *90.2* Tool Use Tau² Bench 80.9 *85.7* 85.4 80.3 82.1 Tool Use BFCLV4 63.1 *77.5* 72.5 61.2 67.7 Tool Use Vita Bench 38.2 *56.3* 51.6 44.1 40.9 Tool Use Deep Planning *44.6* 33.9 23.3 21.6 28.7 Long Context AALCR 72.7 *74.0* 70.7 65.0 68.7