Weighted ranking report

Best AI for Math

Competition math, formal proofs, research-level problem solving, and verifiable reasoning. This report blends public leaderboard signals into one task-specific composite score, then highlights the practical cautions that matter before you choose a model.

Current winner

Claude Fable 5

Adjusted score

95.9

Snapshot

2026-06-12

Top contenders

The leading five models in this composite

21 total models
1Claude logo

Claude Fable 5

Model 98 · Confidence 90%

95.9

2Gemini logo

Gemini 3.5 Flash

Model 100 · Confidence 65%

91.2

3Claude logo

Claude Opus 4.6 Thinking

Model 100 · Confidence 65%

91.2

4OpenAI logo

GPT-5.4 High

Model 99 · Confidence 65%

90.5

5Claude logo

Claude Opus 4.6

Model 99 · Confidence 65%

90.5

Full-source models

2

Average coverage

39%

Top-five spread

5.4

All ranked models

Complete composite model ranking

Showing all 21 models with at least one confirmed source row in this category. Models with no category source coverage are excluded. Confirmed rows are ordered by Bayesian-smoothed adjusted score; missing source rows stay n/a instead of counting as zero.

1Claude logo

Claude Fable 5

AnthropicProprietary API

Best overall pick when formal proof and research math both matter.

Caution

Still verify every proof or derivation independently.

Source coverage5/6

75% confirmed coverage · 90% confidence

Mathn/aVals ProofBench100Surge AI Riemann-bench100Vellum AIME 202594Artificial Analysis AIME 202594AA Index100

Adjusted score

95.9

#1

Model

98

Confidence

90%

2Gemini logo

Gemini 3.5 Flash

GoogleProprietary API and apps

Fast math assistance and high-volume quantitative problem solving.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math100Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

91.2

#2

Model

100

Confidence

65%

3Claude logo

Claude Opus 4.6 Thinking

AnthropicProprietary API

Extended-reasoning math explanations and proof-style drafts.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math100Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

91.2

#3

Model

100

Confidence

65%

4OpenAI logo

GPT-5.4 High

OpenAIProprietary API

High-effort OpenAI math reasoning and contest-style problem solving.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math99Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

90.5

#4

Model

99

Confidence

65%

5Claude logo

Claude Opus 4.6

AnthropicProprietary API

Strong math explanations and symbolic reasoning drafts.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math99Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

90.5

#5

Model

99

Confidence

65%

6Claude logo

Claude Opus 4.8 Thinking

AnthropicProprietary API

Careful math reasoning where slower thinking-mode review is useful.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math99Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

90.5

#6

Model

99

Confidence

65%

7Gemini logo

Gemini 3.1 Pro Preview

GoogleProprietary API and apps

Long-context math prompts, multimodal math, and Google ecosystem workflows.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math99Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

90.5

#7

Model

99

Confidence

65%

8OpenAI logo

GPT-5.5 High

OpenAIProprietary API

High-effort OpenAI math runs when exact reasoning settings can be controlled.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#8

Model

98

Confidence

65%

9Claude logo

Claude Opus 4.7

AnthropicProprietary API

Premium math support and detailed solution review.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#9

Model

98

Confidence

65%

10Claude logo

Claude Opus 4.7 Thinking

AnthropicProprietary API

Thinking-mode solution planning and multi-step derivations.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#10

Model

98

Confidence

65%

11Qwen logo

Qwen3.7 Max Preview

AlibabaProprietary API

Qwen math reasoning tests and cost-aware quantitative workflows.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#11

Model

98

Confidence

65%

12Kimi logo

Kimi K2.6

MoonshotModified MIT

Long-context math reading and open-weight oriented comparison.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#12

Model

98

Confidence

65%

13Xiaomi MiMo logo

Mimo V2.5 Pro

XiaomiMIT

Open-weight math shortlist testing and lower-cost deployment experiments.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math98Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.9

#13

Model

98

Confidence

65%

14Qwen logo

Qwen3.6 Max Preview

AlibabaProprietary API

Preview-model Qwen math comparisons.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math97Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.2

#14

Model

97

Confidence

65%

15MiniMax logo

MiniMax M3

MiniMaxProprietary API

Alternative math reasoning tests across structured prompts.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math97Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.2

#15

Model

97

Confidence

65%

16Gemini logo

Gemini 3 Flash

GoogleProprietary API and apps

Fast math drafting and lower-latency reasoning workflows.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math97Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.2

#16

Model

97

Confidence

65%

17Kimi logo

Kimi K2.5 Thinking

MoonshotModified MIT

Reasoning-forward Kimi math prompts and long-context derivations.

Caution

Arena Math preference data is confirmed for this row; verify final answers with symbolic checks or your own benchmark set.

Source coverage1/6

25% confirmed coverage · 65% confidence

Math97Vals ProofBenchn/aSurge AI Riemann-benchn/aVellum AIME 2025n/aArtificial Analysis AIME 2025n/aAA Indexn/a

Adjusted score

89.2

#17

Model

97

Confidence

65%

18Claude logo

Claude Opus 4.8

AnthropicProprietary API

Formal math and careful proof-style reasoning.

Caution

Less dominant than Fable 5 on Riemann-bench.

Source coverage6/6

100% confirmed coverage · 100% confidence

Math98Vals ProofBench90Surge AI Riemann-bench47Vellum AIME 202594Artificial Analysis AIME 202590AA Index94

Adjusted score

84.2

#18

Model

84.2

Confidence

100%

19OpenAI logo

GPT-5.5

OpenAIProprietary API

Hard math plus general advanced reasoning.

Caution

Use exact model effort settings; benchmark names often include reasoning level.

Source coverage5/6

75% confirmed coverage · 90% confidence

Mathn/aVals ProofBench82Surge AI Riemann-bench76Vellum AIME 202588Artificial Analysis AIME 202588AA Index92

Adjusted score

82.5

#19

Model

83.1

Confidence

90%

20OpenAI logo

GPT-5.2

OpenAIProprietary API

AIME-style math and broad reasoning workflows.

Caution

Formal proof benchmarks are a different skill from contest answer accuracy.

Source coverage5/6

75% confirmed coverage · 90% confidence

Mathn/aVals ProofBench78Surge AI Riemann-bench58Vellum AIME 2025100Artificial Analysis AIME 2025100AA Index90

Adjusted score

80.4

#20

Model

80.8

Confidence

90%

21Gemini logo

Gemini 3 Pro

GoogleProprietary API and apps

Competition math, long context, and multimodal math workflows.

Caution

Riemann-bench and formal proof results trail the top Anthropic model in this snapshot.

Source coverage6/6

100% confirmed coverage · 100% confidence

Math97Vals ProofBench78Surge AI Riemann-bench28Vellum AIME 2025100Artificial Analysis AIME 202595AA Index88

Adjusted score

78.3

#21

Model

78.3

Confidence

100%

How to use it

Treat the winner as a shortlist, not a final procurement decision

The top model is the best blended pick for this query snapshot, but model choice should still account for price, latency, privacy, context length, tool access, safety settings, and your own benchmark prompts. Use this page to reduce the search space, then run a small evaluation on your real tasks before standardizing.

Other ranking reports