Your unbiased guide to the world's smartest AIs
One-click access to today's frontier leaders. Ranked by capability, updated weekly.
First Mythos-class model cleared for general use. 95% SWE-bench Verified, leads GDPval-AA at 1932 Elo. Sharp gains on long-horizon agentic work; larger leads the harder the task. $10/$50, 1M context. Safeguards fallback to Opus 4.8 on <5% of cyber/bio/distillation sessions.
Codebase-scale agentic coding: parallel-subagent dynamic workflows migrate hundreds of thousands of lines from kickoff to merge. 88.6% on SWE-bench Verified, same price as 4.7.
Dominates reasoning and multimodal tasks. Scored 77% on ARC-AGI-2 — double its predecessor — and leads on graduate-level science benchmarks.
Multi-agent architecture with 1M context and built-in reasoning on every query. Now the cheapest frontier-quality option in the set — best for high-stakes research without the sugarcoating.
First fully retrained base since GPT-4.5. Leads agentic coding benchmarks with 82.7% on Terminal-Bench 2.0 and 73.1% on Expert-SWE. The enterprise default for autonomous multi-step work.
Purpose-built for sustained agentic execution and long-horizon coding workflows. 80.4% SWE-bench Verified and top-tier code arena slices at roughly half the price of closed flagships.
Elo ratings from Chatbot Arena blind votes. These shift weekly — here's the current snapshot.
No API key, no usage limits, no data leaving your machine. The best open-weight models ranked by Elo.
Google's Gemma 4 31B ranks #3 among all open-weight models on LMArena with an Elo of 1452 — beating models 20x its size. Fits a single RTX 4090 at Q4_K_M and ships Apache 2.0.
Alibaba's Qwen3 flagship dense model. Matches Qwen2.5-72B performance in a 32B package that fits a single RTX 4090, with hybrid thinking/non-thinking modes and Apache 2.0.
The strongest model that comfortably fits a single 24 GB GPU. Adds vision and tool-use over 3.1, tightly instruction-tuned, and Apache 2.0 with no usage restrictions — the best entry point for local AI.
No hype. Where each model actually leads, based on benchmarks and real-world usage as of today.
Mythos-class public debut at 1525 Elo. Fable 5 leads capability evals including 1932 GDPval-AA (Opus 4.8: 1890). First time this tier is available to normal users.
95% on SWE-bench Verified and 80% on SWE-bench Pro. Clear step-change on long-horizon agentic coding and complex workflows; bigger gains as tasks get harder.
Leads on PhD-level benchmarks like GPQA and ARC-AGI subsets. Claude and Grok are strong contenders.
Shines for maximally truthful, witty conversation. Great for brainstorming without corporate polish.
Weekly analysis, honest takes, and hidden gems. No engagement bait.
Fable 5 is the first Mythos-class model cleared for normal users, with big leads on long-horizon coding and research at double the prior price and with new safeguard fallbacks.
Qwen3.7-Max becomes the fifth tracked frontier model after clearing helloai's admission requirements. The first new-provider entrant brings strong agentic coding performance at roughly half the price of closed leaders.
The top of the leaderboard is now four flagships inside a 19-point Elo band, and Opus 4.8 added barely a point over its predecessor. Capability has converged; price and fit are the whole decision now.
Opus 4.8 ships 41 days after 4.7 at the same price, with a one-point SWE-bench bump and a new engine that fans a single task across hundreds of parallel subagents. The frontier race is moving from IQ to orchestration.
Only 15% of organizations are production-ready for agents, yet 41% are running them anyway. The bottleneck isn't model intelligence — it's the data and ops layer underneath.
Anthropic built Claude Mythos to autonomously find zero-day vulnerabilities, then refused to ship it publicly, gating access through the Project Glasswing consortium.
Gemini 3.5 Flash beats its own 3.1 Pro flagship on agentic and coding benchmarks at $1.50/$9 per million tokens — the most cost-competitive frontier-class model launched this week.
Z.ai's GLM-4.6 lands within ~60 Elo of the closed frontier and ships under a no-strings MIT license — the best open-weight model nobody in Western dev circles is discussing.
DeepSeek V4 ships under MIT with $0.30/M output tokens — 83x cheaper than Claude Opus 4.7 — while scoring 80.6% on SWE-bench Verified. The agentic-coding price floor just moved an order of magnitude.
Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 — 24 points above GPT-5.5 — yet Arena Elo places it in a three-way tie. Here's what the leaderboard hides, and when the reasoning gap actually changes your routing decision.