Methodology

How AI Jupyter builds weighted AI model rankings

Scoring process

  1. Choose a search intent, such as best AI for coding or best AI for image generation.
  2. Select public sources that are relevant to that task rather than relying on one generic chart.
  3. Normalize each source score onto a 0 to 100 scale for that category snapshot.
  4. Apply source weights based on task fit, recency, measurement quality, and coverage.
  5. Apply a small coverage adjustment when a model is missing from a source.
  6. Publish the result as an editorial snapshot with links to the underlying sources.

This method is designed for useful comparison, not scientific certainty. Real-world results can differ because prompts, safety settings, reasoning effort, latency, price, context length, and tool access all affect model performance.

Source categories

Coding

  • Arena.ai Code Arena
  • Vals AI SWE-bench
  • Vals AI Vibe Code Bench
  • Vellum SWE Bench rankings
  • Artificial Analysis Intelligence Index

Writing and essays

  • Arena.ai Creative Writing
  • Surge AI Hemingway-bench
  • EQ-Bench Creative Writing
  • EQ-Bench Longform Writing
  • Arena.ai Text Overall

Math

  • Vals AI ProofBench
  • Surge AI Riemann-bench
  • Vellum AIME rankings
  • Artificial Analysis AIME 2025
  • Artificial Analysis Intelligence Index

Image generation

  • Arena.ai Text-to-Image Arena
  • Artificial Analysis Text-to-Image leaderboard