Claude Fable 5
AnthropicProprietary APIBest fit
Highest-confidence coding and autonomous engineering tasks.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
99.9
Model
99.9
Confidence
100%
Agentic coding, web development, repository fixes, and production engineering tasks. This report blends public leaderboard signals into one task-specific composite score, then shows the best-fit use cases, evidence coverage, and decision context behind each ranked model.
Use this coding ranking to shortlist AI models for repository edits, agentic coding, bug fixes, code review, and web application generation.
Last updated: June 16, 2026
MethodologyWhat changed in this update
Page value
Repository fixes and agentic coding shortlist.
Data basis
5 public sources · 86 models
Ranking snapshot
2026-06-16
Claude Fable 5
99.9
2026-06-16
Best for
Evaluate
Avoid
All ranked models
Showing 86 models with at least one source score. Rows are ordered by Bayesian-smoothed adjusted score; missing source rows stay n/a instead of counting as zero.
Best fit
Highest-confidence coding and autonomous engineering tasks.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
99.9
Model
99.9
Confidence
100%
Best fit
Large code changes where review quality matters.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
92.7
Model
92.7
Confidence
100%
Best fit
High-end coding when extended reasoning mode is useful.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
89.9
Model
94
Confidence
69%
Best fit
Large code changes that benefit from slower thinking-mode review.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
89.9
Model
94
Confidence
69%
Best fit
Stable high-end coding when Fable or Opus 4.8 is unavailable.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
89.9
Model
89.9
Confidence
100%
Best fit
Extended reasoning on large repositories and multi-file refactors.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
89.2
Model
93
Confidence
69%
Best fit
Premium coding assistance with strong instruction following.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
88.5
Model
92
Confidence
69%
Best fit
Cost-aware coding workflows where Alibaba model access is preferred.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
88.5
Model
92
Confidence
69%
Best fit
Open-weight friendly coding experiments and self-hosted evaluations.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
88.5
Model
92
Confidence
69%
Best fit
Daily coding edits, review, and lower-cost agentic workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
87.9
Model
91
Confidence
69%
Best fit
Alternative coding assistant testing across web development prompts.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
87.9
Model
91
Confidence
69%
Best fit
Long-context code reading and open-weight oriented comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
87.9
Model
91
Confidence
69%
Best fit
Experimental web development generation and UI coding prompts.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
87.9
Model
91
Confidence
69%
Best fit
Highest-effort OpenAI coding harness runs and difficult repository tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
87.2
Model
90
Confidence
69%
Best fit
Coding plus broad product, data, and tool-use work.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
86.7
Model
86.7
Confidence
100%
Best fit
Careful multi-step coding where thinking-mode behavior is preferred.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
86.5
Model
89
Confidence
69%
Best fit
Preview-model coding tests before standardizing on Qwen releases.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
86.5
Model
89
Confidence
69%
Best fit
Higher-effort OpenAI coding runs with strong tool-use potential.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
86.5
Model
89
Confidence
69%
Best fit
Open-weight coding model tests and lower-cost deployment planning.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.8
Model
88
Confidence
69%
Best fit
High-quality coding and review when newer Opus versions are unavailable.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.8
Model
88
Confidence
69%
Best fit
Balanced Qwen coding workloads and product prototyping.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.8
Model
88
Confidence
69%
Best fit
Reasoning-heavy coding comparisons with open-weight deployment options.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.8
Model
88
Confidence
69%
Best fit
High-effort coding assistance and tool-call workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.8
Model
88
Confidence
69%
Best fit
Google ecosystem coding tests and preview-model evaluation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
85.1
Model
87
Confidence
69%
Best fit
Open-weight coding and local evaluation candidates.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
General coding, research, and Google-integrated development workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Balanced OpenAI coding runs where latency and cost matter.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Fast web development assistance and lightweight coding workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Open-weight coding shortlist testing.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Open-weight coding model evaluation with Z.ai releases.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Xiaomi coding model comparison for web development tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Reasoning-forward coding and long-context repository prompts.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
84.4
Model
86
Confidence
69%
Best fit
Fast coding assistance and lower-latency product workflows.
Full evidence: 5/5 sources · 100% confidence
Adjusted score
84.3
Model
84.3
Confidence
100%
Best fit
Faster Kimi coding runs and draft implementation work.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.7
Model
85
Confidence
69%
Best fit
Codex-style coding workflows and repository automation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.7
Model
85
Confidence
69%
Best fit
General OpenAI coding support where newer variants are unavailable.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Smaller high-effort OpenAI coding tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Open-ish MiniMax coding model tests and web UI generation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Open-source Qwen coding evaluation and self-hosted experiments.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Baseline OpenAI coding assistance.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
General OpenAI coding and product engineering tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Preview-model web development comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Older OpenAI coding benchmark comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
83.1
Model
84
Confidence
69%
Best fit
Reasoning-heavy xAI coding tests and alternative model comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Thinking-mode coding and careful edit planning.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Low-latency Gemini coding with minimal thinking behavior.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Legacy Opus-class coding and review comparisons.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Routine coding, bug fixes, and code review.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
MiniMax coding model evaluation on web tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Open-source local or hosted coding experiments.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
82.4
Model
83
Confidence
69%
Best fit
Codex-style coding workflows and repository automation at the lower Arena score band.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
xAI coding assistant evaluation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Reasoning-oriented open-weight coding tests.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Smaller Qwen self-hosted coding experiments.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Tencent preview-model coding comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Smaller open-source coding and local deployment tests.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Lower-resource Qwen coding experiments.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81.7
Model
82
Confidence
69%
Best fit
Open-weight coding model comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
81
Model
81
Confidence
69%
Best fit
Legacy OpenAI coding benchmark reference.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Fast open-weight Xiaomi coding tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Codex-style repository automation on older OpenAI releases.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Open-weight coding and local agent comparisons.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Fast reasoning-focused Kimi coding workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Older Codex-style coding workflows.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Fast, lower-cost coding assistance and code review drafts.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
80.3
Model
80
Confidence
69%
Best fit
Open-source MiniMax coding baseline comparisons.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
78.9
Model
78
Confidence
69%
Best fit
Fast Xiaomi coding model tests with thinking behavior.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
78.9
Model
78
Confidence
69%
Best fit
Experimental DeepSeek coding model comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
78.3
Model
77
Confidence
69%
Best fit
Coder-specialized Qwen deployment and benchmark testing.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
78.3
Model
77
Confidence
69%
Best fit
Mistral coding benchmark comparison and EU-provider shortlist work.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
77.6
Model
76
Confidence
69%
Best fit
Coder-specialized comparison across web development tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
77.6
Model
76
Confidence
69%
Best fit
Smaller Qwen coding deployments and local evaluation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.9
Model
75
Confidence
69%
Best fit
Low-latency Gemini coding assistant comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.9
Model
75
Confidence
69%
Best fit
Open-source thinking-model coding tests.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.9
Model
75
Confidence
69%
Best fit
Lower-cost Codex-style coding support.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.2
Model
74
Confidence
69%
Best fit
Fast Qwen coding assistance and draft implementation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.2
Model
74
Confidence
69%
Best fit
Fast xAI reasoning on coding tasks.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
76.2
Model
74
Confidence
69%
Best fit
Mistral coding and reasoning comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
75.5
Model
73
Confidence
69%
Best fit
xAI thinking-mode coding tests.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
75.5
Model
73
Confidence
69%
Best fit
Legacy Gemini coding benchmark reference.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
74.8
Model
72
Confidence
69%
Best fit
Small open-source coding model experiments.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
74.8
Model
72
Confidence
69%
Best fit
Developer-focused Mistral coding benchmark comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
74.8
Model
72
Confidence
69%
Best fit
Alternative model coding evaluation.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
73.5
Model
70
Confidence
69%
Best fit
Fast Grok reasoning for coding prompts.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
72.8
Model
69
Confidence
69%
Best fit
Code-specialized Grok assistant comparison.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
72.1
Model
68
Confidence
69%
Best fit
Lower-ranked Mistral developer-model baseline.
Low evidence: 1/5 sources · 69% confidence
Adjusted score
70.7
Model
66
Confidence
69%
Decision guide
Questions
The top row is the blended pick for this snapshot, but the right coding model still depends on your repository, tool access, latency target, and budget.
A missing source row is uncertainty, not a failed benchmark. The adjusted score blends available quality signals with a confidence penalty instead of assuming zero performance.
No. Use the ranking as a shortlist, then test the top models on your own repo tasks before standardizing.
Method note
The top model is the best blended pick for this query snapshot, but model choice should still account for price, latency, privacy, context length, tool access, safety settings, and your own benchmark prompts. Use this page to reduce the search space, then run a small evaluation on your real tasks before standardizing. See the methodology and editorial policy for source selection and correction standards.