Your unbiased guide to the world's smartest AIs
One-click access to today's frontier leaders. Ranked by capability, updated weekly.
Codebase-scale agentic coding: parallel-subagent dynamic workflows migrate hundreds of thousands of lines from kickoff to merge. 88.6% on SWE-bench Verified, same price as 4.7.
Dominates reasoning and multimodal tasks. Scored 77% on ARC-AGI-2 — double its predecessor — and leads on graduate-level science benchmarks.
Multi-agent architecture with 1M context and built-in reasoning on every query. Now the cheapest frontier-quality option in the set — best for high-stakes research without the sugarcoating.
First fully retrained base since GPT-4.5. Leads agentic coding benchmarks with 82.7% on Terminal-Bench 2.0 and 73.1% on Expert-SWE. The enterprise default for autonomous multi-step work.
Purpose-built for sustained agentic execution and long-horizon coding workflows. 80.4% SWE-bench Verified and top-tier code arena slices at roughly half the price of closed flagships.
Elo ratings from Chatbot Arena blind votes. These shift weekly — here's the current snapshot.
No API key, no usage limits, no data leaving your machine. The best open-weight models ranked by Elo.
Google's Gemma 4 31B ranks #3 among all open-weight models on LMArena with an Elo of 1452 — beating models 20x its size. Fits a single RTX 4090 at Q4_K_M and ships Apache 2.0.
Alibaba's Qwen3 flagship dense model. Matches Qwen2.5-72B performance in a 32B package that fits a single RTX 4090, with hybrid thinking/non-thinking modes and Apache 2.0.
The strongest model that comfortably fits a single 24 GB GPU. Adds vision and tool-use over 3.1, tightly instruction-tuned, and Apache 2.0 with no usage restrictions — the best entry point for local AI.
No hype. Where each model actually leads, based on benchmarks and real-world usage as of today.
Highest Elo in the tracked set at 1503 after Fable 5 suspension. Strong on agentic coding and parallel subagent orchestration for large-scale engineering work.
88.6% on SWE-bench Verified and 69.2% on SWE-bench Pro. Leads for codebase-scale agentic coding with dynamic parallel subagent workflows; gains widen on harder, longer tasks.
Leads on PhD-level benchmarks like GPQA and ARC-AGI subsets. Claude and Grok are strong contenders.
Shines for maximally truthful, witty conversation. Great for brainstorming without corporate polish.
Weekly analysis, honest takes, and hidden gems. No engagement bait.
Just three days after public launch, a US export control order forced Anthropic to disable Fable 5 and Mythos 5 for every customer worldwide. A narrow reported jailbreak on vulnerability discovery triggered the blanket recall.
Fable 5 is the first Mythos-class model cleared for normal users, with big leads on long-horizon coding and research at double the prior price and with new safeguard fallbacks.
Qwen3.7-Max becomes the fifth tracked frontier model after clearing helloai's admission requirements. The first new-provider entrant brings strong agentic coding performance at roughly half the price of closed leaders.
The top of the leaderboard is now four flagships inside a 19-point Elo band, and Opus 4.8 added barely a point over its predecessor. Capability has converged; price and fit are the whole decision now.
Opus 4.8 ships 41 days after 4.7 at the same price, with a one-point SWE-bench bump and a new engine that fans a single task across hundreds of parallel subagents. The frontier race is moving from IQ to orchestration.
Only 15% of organizations are production-ready for agents, yet 41% are running them anyway. The bottleneck isn't model intelligence — it's the data and ops layer underneath.
Anthropic built Claude Mythos to autonomously find zero-day vulnerabilities, then refused to ship it publicly, gating access through the Project Glasswing consortium.
Gemini 3.5 Flash beats its own 3.1 Pro flagship on agentic and coding benchmarks at $1.50/$9 per million tokens — the most cost-competitive frontier-class model launched this week.
Z.ai's GLM-4.6 lands within ~60 Elo of the closed frontier and ships under a no-strings MIT license — the best open-weight model nobody in Western dev circles is discussing.
DeepSeek V4 ships under MIT with $0.30/M output tokens — 83x cheaper than Claude Opus 4.7 — while scoring 80.6% on SWE-bench Verified. The agentic-coding price floor just moved an order of magnitude.
Gemini 3.1 Pro scores 77.1% on ARC-AGI-2 — 24 points above GPT-5.5 — yet Arena Elo places it in a three-way tie. Here's what the leaderboard hides, and when the reasoning gap actually changes your routing decision.