Why math is hard for language models
Standard language models are trained to predict likely text. Mathematics requires the opposite: not likely output, but correct output, where every step follows necessarily from the last and the answer is either right or wrong. A model that halluccinates plausibly in prose fails silently; a model that halluccinates in a proof fails loudly. This is why mathematical reasoning long lagged behind other language tasks — and why the recent progress, built on extended reasoning and reinforcement learning from verifiable outcomes, represents a qualitative shift rather than incremental improvement.
What the AIME benchmark measures
The American Invitational Mathematics Examination is the qualifier for the USA Mathematical Olympiad. Each of its 15 problems requires multi-step symbolic reasoning with no multiple-choice scaffolding. Historically, top human scorers answer 10–12 correctly. A score of 90%+ means the model is solving around 13–14 problems — at the level of students who go on to compete internationally. AIME has become the standard benchmark for math AI because it's hard enough to differentiate models, objective enough to score automatically, and familiar enough that researchers understand what each point of improvement actually means.
Which models lead in 2026?
As of mid-2026, the top of the AIME leaderboard is contested. GPT-5.2 on extended reasoning (xhigh setting) achieves 99% on AIME 2025 — near-perfect performance, though at significant latency cost. THUDM's GLM-4.7 (Thinking) leads the open benchmark table with 95% on AIME 2025 and 86% on GPQA Diamond, a graduate-level science reasoning benchmark. DeepSeek's R1 family remains widely deployed in production: DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024 and 94.3% on the MATH-500 benchmark at far lower inference cost than closed models. Alibaba's QwQ-32B is the strong open-weight generalist, widely used for its balance of capability and deployability.
Claude Mythos Preview leads some math leaderboards with a score of 61.4 on the combined math index compiled by llm-stats.com, followed by Muse Spark (54.0) and Gemini 3.1 Pro (53.0). All three use extended reasoning modes that add 2–5× latency compared to standard generation. The pattern is consistent: the best math scores come from models that pause to think, not models that answer immediately.
How extended reasoning works
The key technique is chain-of-thought reasoning under reinforcement learning with verifiable rewards. During training, the model is rewarded not just for producing a correct final answer but for producing reasoning chains that lead to correct answers. This creates a trained behavior where the model "thinks out loud" — generating intermediate steps, checking them, backtracking when something doesn't hold, and arriving at a verified answer rather than a predicted one. A single AIME problem can take 30–60 seconds of inference under extended reasoning; for casual math queries, standard generation is faster and often sufficient.
For symbolic computation — simplifying expressions, solving integrals, checking algebraic identities — Wolfram Alpha remains more reliable than any LLM. Language models are better at multi-step reasoning about novel problems than at rote symbolic manipulation. The practical workflow: use Wolfram for exact symbolic work, use a reasoning LLM for problem decomposition and proof strategy.
Formal theorem proving
A separate but related frontier is AI in formal proof systems like Lean and Coq, where proofs are machine-checked and every step must satisfy type-theoretic constraints. Google DeepMind's AlphaProof demonstrated in 2024 that a model could solve International Mathematical Olympiad problems in Lean with verified proofs. In 2026, the models pushing this frontier combine language model reasoning with formal verification environments — generating proof attempts, checking them against the verifier, and iterating until a proof either succeeds or fails definitively. This is qualitatively different from AIME performance: AIME checks final answers; formal proof systems check every step.
Open-source options worth knowing
For teams that need math reasoning without closed-model pricing or data-handling constraints, the open-weight options are strong. DeepSeek-R1 and its Qwen-distilled variants are the most widely deployed, with R1-Distill-Qwen-32B offering the best balance of performance and resource requirements for self-hosted inference. Alibaba's QwQ-32B-Preview is the strong generalist. THUDM's GLM series (from Tsinghua University's KEG lab) has consistently topped open math benchmarks in 2026. All three support extended reasoning modes.
Where math LLMs still fall short
Reliability at the hardest tier remains inconsistent. Models that score 99% on AIME 2025 still hallucinate on problems outside their training distribution, make arithmetic errors that a human would catch instantly, and struggle with certain geometry and combinatorics problem types. For production use in education or research tools, the standard practice is to run multiple completions and take the majority answer — a technique called self-consistency sampling — rather than trusting a single generation. Symbolic computation remains the uncrossed line: no LLM currently beats Wolfram Alpha or Mathematica on systematic algebraic manipulation.