Weighted ranking report

Best AI for Coding

Agentic coding, web development, repository fixes, and production engineering tasks. This report blends public leaderboard signals into one task-specific composite score, then shows the best-fit use cases, evidence coverage, and decision context behind each ranked model.

Use this coding ranking to shortlist AI models for repository edits, agentic coding, bug fixes, code review, and web application generation.

Last updated: June 16, 2026

Methodology

What changed in this update

  • Added a visible update note so readers can see when the coding shortlist was last reviewed.
  • Rechecked coding-source weights across repository repair, SWE-bench, vibe coding, and general intelligence signals.
  • Expanded the decision guide for teams choosing between coding agents, IDE assistants, and API-based code workflows.

Page value

Repository fixes and agentic coding shortlist.

Data basis

5 public sources · 86 models

Ranking snapshot

2026-06-16

Current winner

Claude Fable 5

Adjusted score

99.9

Snapshot

2026-06-16

Best for

  • Agentic coding workflows that edit files, run tests, and explain diffs.
  • Web development, refactoring, debugging, and framework-specific implementation tasks.

Evaluate

  • Run each finalist on a private repository test set with real issues and failing tests.
  • Compare diff quality, tool reliability, latency, price, context length, and rollback behavior.

Avoid

  • Letting any model make autonomous production changes without human review.
  • Choosing only by general chat quality when the task is code execution and repository repair.

All ranked models

Complete composite model ranking

Showing 86 models with at least one source score. Rows are ordered by Bayesian-smoothed adjusted score; missing source rows stay n/a instead of counting as zero.

1Claude logo

Claude Fable 5

AnthropicProprietary API

Best fit

Highest-confidence coding and autonomous engineering tasks.

Source coverage5/5

Full evidence: 5/5 sources · 100% confidence

Code Arena100Vals SWE-bench Verified100Vals Vibe Code Bench100Vellum99AA Index100

Adjusted score

99.9

#1

Model

99.9

Confidence

100%

2Claude logo

Claude Opus 4.8

AnthropicProprietary API

Best fit

Large code changes where review quality matters.

Source coverage5/5

Full evidence: 5/5 sources · 100% confidence

Code Arena93Vals SWE-bench Verified93Vals Vibe Code Bench91Vellum93AA Index94

Adjusted score

92.7

#2

Model

92.7

Confidence

100%

3Claude logo

Claude Opus 4.7 Thinking

AnthropicProprietary API

Best fit

High-end coding when extended reasoning mode is useful.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena94Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

89.9

#3

Model

94

Confidence

69%

4Claude logo

Claude Opus 4.8 Thinking

AnthropicProprietary API

Best fit

Large code changes that benefit from slower thinking-mode review.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena94Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

89.9

#4

Model

94

Confidence

69%

5Claude logo

Claude Opus 4.7

AnthropicProprietary API

Best fit

Stable high-end coding when Fable or Opus 4.8 is unavailable.

Source coverage5/5

Full evidence: 5/5 sources · 100% confidence

Code Arena94Vals SWE-bench Verified86Vals Vibe Code Bench87Vellum92AA Index89

Adjusted score

89.9

#5

Model

89.9

Confidence

100%

6Claude logo

Claude Opus 4.6 Thinking

AnthropicProprietary API

Best fit

Extended reasoning on large repositories and multi-file refactors.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena93Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

89.2

#6

Model

93

Confidence

69%

7Claude logo

Claude Opus 4.6

AnthropicProprietary API

Best fit

Premium coding assistance with strong instruction following.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena92Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

88.5

#7

Model

92

Confidence

69%

8Qwen logo

Qwen3.7 Max

AlibabaProprietary API

Best fit

Cost-aware coding workflows where Alibaba model access is preferred.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena92Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

88.5

#8

Model

92

Confidence

69%

9Z.ai logo

GLM-5.1

Z.aiMIT

Best fit

Open-weight friendly coding experiments and self-hosted evaluations.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena92Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

88.5

#9

Model

92

Confidence

69%

10Claude logo

Claude Sonnet 4.6

AnthropicProprietary API

Best fit

Daily coding edits, review, and lower-cost agentic workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena91Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

87.9

#10

Model

91

Confidence

69%

11MiniMax logo

MiniMax M3

MiniMaxProprietary API

Best fit

Alternative coding assistant testing across web development prompts.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena91Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

87.9

#11

Model

91

Confidence

69%

12Kimi logo

Kimi K2.6

MoonshotModified MIT

Best fit

Long-context code reading and open-weight oriented comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena91Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

87.9

#12

Model

91

Confidence

69%

13Meta logo

Muse Spark

MetaProprietary API

Best fit

Experimental web development generation and UI coding prompts.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena91Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

87.9

#13

Model

91

Confidence

69%

14OpenAI logo

GPT-5.5 xhigh

OpenAIProprietary API

Best fit

Highest-effort OpenAI coding harness runs and difficult repository tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena90Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

87.2

#14

Model

90

Confidence

69%

15OpenAI logo

GPT-5.5

OpenAIProprietary API

Best fit

Coding plus broad product, data, and tool-use work.

Source coverage5/5

Full evidence: 5/5 sources · 100% confidence

Code Arena87Vals SWE-bench Verified87Vals Vibe Code Bench83Vellum87AA Index92

Adjusted score

86.7

#15

Model

86.7

Confidence

100%

16Claude logo

Claude Opus 4.5 Thinking

AnthropicProprietary API

Best fit

Careful multi-step coding where thinking-mode behavior is preferred.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena89Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

86.5

#16

Model

89

Confidence

69%

17Qwen logo

Qwen3.6 Max Preview

AlibabaProprietary API

Best fit

Preview-model coding tests before standardizing on Qwen releases.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena89Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

86.5

#17

Model

89

Confidence

69%

18OpenAI logo

GPT-5.5 High

OpenAIProprietary API

Best fit

Higher-effort OpenAI coding runs with strong tool-use potential.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena89Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

86.5

#18

Model

89

Confidence

69%

19Xiaomi MiMo logo

Mimo V2.5 Pro

XiaomiMIT

Best fit

Open-weight coding model tests and lower-cost deployment planning.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena88Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.8

#19

Model

88

Confidence

69%

20Claude logo

Claude Opus 4.5

AnthropicProprietary API

Best fit

High-quality coding and review when newer Opus versions are unavailable.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena88Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.8

#20

Model

88

Confidence

69%

21Qwen logo

Qwen3.6 Plus

AlibabaProprietary API

Best fit

Balanced Qwen coding workloads and product prototyping.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena88Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.8

#21

Model

88

Confidence

69%

22DeepSeek logo

DeepSeek V4 Pro Thinking

DeepSeekMIT

Best fit

Reasoning-heavy coding comparisons with open-weight deployment options.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena88Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.8

#22

Model

88

Confidence

69%

23OpenAI logo

GPT-5.4 High

OpenAIProprietary API

Best fit

High-effort coding assistance and tool-call workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena88Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.8

#23

Model

88

Confidence

69%

24Gemini logo

Gemini 3.1 Pro Preview

GoogleProprietary API

Best fit

Google ecosystem coding tests and preview-model evaluation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena87Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

85.1

#24

Model

87

Confidence

69%

25Z.ai logo

GLM-4.7

Z.aiMIT

Best fit

Open-weight coding and local evaluation candidates.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#25

Model

86

Confidence

69%

26Gemini logo

Gemini 3 Pro

GoogleProprietary API

Best fit

General coding, research, and Google-integrated development workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#26

Model

86

Confidence

69%

27OpenAI logo

GPT-5.4 Medium

OpenAIProprietary API

Best fit

Balanced OpenAI coding runs where latency and cost matter.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#27

Model

86

Confidence

69%

28Gemini logo

Gemini 3 Flash

GoogleProprietary API

Best fit

Fast web development assistance and lightweight coding workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#28

Model

86

Confidence

69%

29Xiaomi MiMo logo

Mimo V2.5

XiaomiMIT

Best fit

Open-weight coding shortlist testing.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#29

Model

86

Confidence

69%

30Z.ai logo

GLM-5

Z.aiMIT

Best fit

Open-weight coding model evaluation with Z.ai releases.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#30

Model

86

Confidence

69%

31Xiaomi MiMo logo

Mimo V2 Pro

XiaomiProprietary API

Best fit

Xiaomi coding model comparison for web development tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#31

Model

86

Confidence

69%

32Kimi logo

Kimi K2.5 Thinking

MoonshotModified MIT

Best fit

Reasoning-forward coding and long-context repository prompts.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena86Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

84.4

#32

Model

86

Confidence

69%

33Gemini logo

Gemini 3.5 Flash

GoogleProprietary API

Best fit

Fast coding assistance and lower-latency product workflows.

Source coverage5/5

Full evidence: 5/5 sources · 100% confidence

Code Arena90Vals SWE-bench Verified83Vals Vibe Code Bench76Vellum80AA Index88

Adjusted score

84.3

#33

Model

84.3

Confidence

100%

34Kimi logo

Kimi K2.5 Instant

MoonshotModified MIT

Best fit

Faster Kimi coding runs and draft implementation work.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena85Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.7

#34

Model

85

Confidence

69%

35Codex logo

GPT-5.3 Codex

OpenAIProprietary API

Best fit

Codex-style coding workflows and repository automation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena85Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.7

#35

Model

85

Confidence

69%

36OpenAI logo

GPT-5.2

OpenAIProprietary API

Best fit

General OpenAI coding support where newer variants are unavailable.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#36

Model

84

Confidence

69%

37OpenAI logo

GPT-5.4 Mini High

OpenAIProprietary API

Best fit

Smaller high-effort OpenAI coding tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#37

Model

84

Confidence

69%

38MiniMax logo

MiniMax M2.7

MiniMaxModified MIT

Best fit

Open-ish MiniMax coding model tests and web UI generation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#38

Model

84

Confidence

69%

39Qwen logo

Qwen3.5 397B A17B

AlibabaApache 2.0

Best fit

Open-source Qwen coding evaluation and self-hosted experiments.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#39

Model

84

Confidence

69%

40OpenAI logo

GPT-5 Medium

OpenAIProprietary API

Best fit

Baseline OpenAI coding assistance.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#40

Model

84

Confidence

69%

41OpenAI logo

GPT-5.4

OpenAIProprietary API

Best fit

General OpenAI coding and product engineering tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#41

Model

84

Confidence

69%

42MiniMax logo

MiniMax M2.1 Preview

MiniMaxMIT

Best fit

Preview-model web development comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#42

Model

84

Confidence

69%

43OpenAI logo

GPT-5.1 Medium

OpenAIProprietary API

Best fit

Older OpenAI coding benchmark comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena84Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

83.1

#43

Model

84

Confidence

69%

44Grok logo

Grok 4.20 Beta Reasoning

xAIProprietary API

Best fit

Reasoning-heavy xAI coding tests and alternative model comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#44

Model

83

Confidence

69%

45Claude logo

Claude Sonnet 4.5 Thinking

AnthropicProprietary API

Best fit

Thinking-mode coding and careful edit planning.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#45

Model

83

Confidence

69%

46Gemini logo

Gemini 3 Flash Thinking Minimal

GoogleProprietary API

Best fit

Low-latency Gemini coding with minimal thinking behavior.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#46

Model

83

Confidence

69%

47Claude logo

Claude Opus 4.1

AnthropicProprietary API

Best fit

Legacy Opus-class coding and review comparisons.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#47

Model

83

Confidence

69%

48Claude logo

Claude Sonnet 4.5

AnthropicProprietary API

Best fit

Routine coding, bug fixes, and code review.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#48

Model

83

Confidence

69%

49MiniMax logo

MiniMax M2.5

MiniMaxModified MIT

Best fit

MiniMax coding model evaluation on web tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#49

Model

83

Confidence

69%

50Gemma logo

Gemma 4 31B

GoogleApache 2.0

Best fit

Open-source local or hosted coding experiments.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena83Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

82.4

#50

Model

83

Confidence

69%

51Codex logo

GPT-5.3 Codex

OpenAIProprietary API

Best fit

Codex-style coding workflows and repository automation at the lower Arena score band.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#51

Model

82

Confidence

69%

52Grok logo

Grok 4.3

xAIProprietary API

Best fit

xAI coding assistant evaluation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#52

Model

82

Confidence

69%

53DeepSeek logo

DeepSeek V3.2 Thinking

DeepSeekMIT

Best fit

Reasoning-oriented open-weight coding tests.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#53

Model

82

Confidence

69%

54Qwen logo

Qwen3.5 122B A10B

AlibabaApache 2.0

Best fit

Smaller Qwen self-hosted coding experiments.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#54

Model

82

Confidence

69%

55Hunyuan logo

Hunyuan HY3 Preview

TencentTencent Hunyuan Community

Best fit

Tencent preview-model coding comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#55

Model

82

Confidence

69%

56Gemma logo

Gemma 4 26B A4B

GoogleApache 2.0

Best fit

Smaller open-source coding and local deployment tests.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#56

Model

82

Confidence

69%

57Qwen logo

Qwen3.5 27B

AlibabaApache 2.0

Best fit

Lower-resource Qwen coding experiments.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena82Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81.7

#57

Model

82

Confidence

69%

58Z.ai logo

GLM-4.6

Z.aiMIT

Best fit

Open-weight coding model comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena81Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

81

#58

Model

81

Confidence

69%

59OpenAI logo

GPT-5.1

OpenAIProprietary API

Best fit

Legacy OpenAI coding benchmark reference.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#59

Model

80

Confidence

69%

60Xiaomi MiMo logo

Mimo V2 Flash Non-Thinking

XiaomiMIT

Best fit

Fast open-weight Xiaomi coding tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#60

Model

80

Confidence

69%

61Codex logo

GPT-5.2 Codex

OpenAIProprietary API

Best fit

Codex-style repository automation on older OpenAI releases.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#61

Model

80

Confidence

69%

62DeepSeek logo

DeepSeek V3.2

DeepSeekMIT

Best fit

Open-weight coding and local agent comparisons.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#62

Model

80

Confidence

69%

63Kimi logo

Kimi K2 Thinking Turbo

MoonshotModified MIT

Best fit

Fast reasoning-focused Kimi coding workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#63

Model

80

Confidence

69%

64Codex logo

GPT-5.1 Codex

OpenAIProprietary API

Best fit

Older Codex-style coding workflows.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#64

Model

80

Confidence

69%

65Claude logo

Claude Haiku 4.5

AnthropicProprietary API

Best fit

Fast, lower-cost coding assistance and code review drafts.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena80Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

80.3

#65

Model

80

Confidence

69%

66MiniMax logo

MiniMax M2

MiniMaxApache 2.0

Best fit

Open-source MiniMax coding baseline comparisons.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena78Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

78.9

#66

Model

78

Confidence

69%

67Xiaomi MiMo logo

Mimo V2 Flash Thinking

XiaomiMIT

Best fit

Fast Xiaomi coding model tests with thinking behavior.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena78Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

78.9

#67

Model

78

Confidence

69%

68DeepSeek logo

DeepSeek V3.2 Exp

DeepSeekMIT

Best fit

Experimental DeepSeek coding model comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena77Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

78.3

#68

Model

77

Confidence

69%

69Qwen logo

Qwen3 Coder 480B A35B Instruct

AlibabaApache 2.0

Best fit

Coder-specialized Qwen deployment and benchmark testing.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena77Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

78.3

#69

Model

77

Confidence

69%

70Mistral logo

Mistral Medium 3.5

MistralModified MIT

Best fit

Mistral coding benchmark comparison and EU-provider shortlist work.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena76Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

77.6

#70

Model

76

Confidence

69%

71KwaiKAT logo

KAT-Coder-Pro-V1

KwaiKATProprietary API

Best fit

Coder-specialized comparison across web development tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena76Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

77.6

#71

Model

76

Confidence

69%

72Qwen logo

Qwen3.5 35B A3B

AlibabaApache 2.0

Best fit

Smaller Qwen coding deployments and local evaluation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena75Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.9

#72

Model

75

Confidence

69%

73Gemini logo

Gemini 3.1 Flash Lite Preview

GoogleProprietary API

Best fit

Low-latency Gemini coding assistant comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena75Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.9

#73

Model

75

Confidence

69%

74Arcee AI logo

Trinity Large Thinking

Arcee AIApache 2.0

Best fit

Open-source thinking-model coding tests.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena75Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.9

#74

Model

75

Confidence

69%

75Codex logo

GPT-5.1 Codex Mini

OpenAIProprietary API

Best fit

Lower-cost Codex-style coding support.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena74Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.2

#75

Model

74

Confidence

69%

76Qwen logo

Qwen3.5 Flash

AlibabaProprietary API

Best fit

Fast Qwen coding assistance and draft implementation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena74Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.2

#76

Model

74

Confidence

69%

77Grok logo

Grok 4.1 Fast Reasoning

xAIProprietary API

Best fit

Fast xAI reasoning on coding tasks.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena74Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

76.2

#77

Model

74

Confidence

69%

78Mistral logo

Mistral Large 3

MistralApache 2.0

Best fit

Mistral coding and reasoning comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena73Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

75.5

#78

Model

73

Confidence

69%

79Grok logo

Grok 4.1 Thinking

xAIProprietary API

Best fit

xAI thinking-mode coding tests.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena73Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

75.5

#79

Model

73

Confidence

69%

80Gemini logo

Gemini 2.5 Pro

GoogleProprietary API

Best fit

Legacy Gemini coding benchmark reference.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena72Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

74.8

#80

Model

72

Confidence

69%

81IBM Granite logo

Granite 4.1 8B

IBMApache 2.0

Best fit

Small open-source coding model experiments.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena72Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

74.8

#81

Model

72

Confidence

69%

82Mistral logo

Devstral 2

MistralModified MIT

Best fit

Developer-focused Mistral coding benchmark comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena72Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

74.8

#82

Model

72

Confidence

69%

83Inception AI logo

Mercury 2

Inception AIProprietary API

Best fit

Alternative model coding evaluation.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena70Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

73.5

#83

Model

70

Confidence

69%

84Grok logo

Grok 4 Fast Reasoning

xAIProprietary API

Best fit

Fast Grok reasoning for coding prompts.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena69Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

72.8

#84

Model

69

Confidence

69%

85Grok logo

Grok Code Fast 1

xAIProprietary API

Best fit

Code-specialized Grok assistant comparison.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena68Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

72.1

#85

Model

68

Confidence

69%

86Mistral logo

Devstral Medium 2507

MistralProprietary API

Best fit

Lower-ranked Mistral developer-model baseline.

Source coverage1/5

Low evidence: 1/5 sources · 69% confidence

Code Arena66Vals SWE-bench Verifiedn/aVals Vibe Code Benchn/aVellumn/aAA Indexn/a

Adjusted score

70.7

#86

Model

66

Confidence

69%

Decision guide

How to choose from this Best AI for Coding ranking

Snapshot 2026-06-16

Best for

  • Agentic coding workflows that edit files, run tests, and explain diffs.
  • Web development, refactoring, debugging, and framework-specific implementation tasks.
  • Teams comparing coding assistants before choosing an IDE plugin, API, or coding agent.

Evaluate

  • Run each finalist on a private repository test set with real issues and failing tests.
  • Compare diff quality, tool reliability, latency, price, context length, and rollback behavior.
  • Check whether the model follows security constraints and avoids changing unrelated files.

Avoid

  • Letting any model make autonomous production changes without human review.
  • Choosing only by general chat quality when the task is code execution and repository repair.
  • Treating missing leaderboard rows as proof that a model is weak; missing rows mean uncertainty.

Questions

Best AI for Coding FAQ

What is the best AI for coding?

The top row is the blended pick for this snapshot, but the right coding model still depends on your repository, tool access, latency target, and budget.

Why are missing benchmark sources not scored as zero?

A missing source row is uncertainty, not a failed benchmark. The adjusted score blends available quality signals with a confidence penalty instead of assuming zero performance.

Should the highest coding score be used automatically?

No. Use the ranking as a shortlist, then test the top models on your own repo tasks before standardizing.

Other ranking reports

Method note

Treat the winner as a shortlist, not a final procurement decision

The top model is the best blended pick for this query snapshot, but model choice should still account for price, latency, privacy, context length, tool access, safety settings, and your own benchmark prompts. Use this page to reduce the search space, then run a small evaluation on your real tasks before standardizing. See the methodology and editorial policy for source selection and correction standards.