Best Model for Coding Right Now
The tracked frontier spans 28 Elo points on coding, but input prices run 4x and output prices 12x. Here is how helloai's scoring picks a winner for engineering work — and when to trade rank for cost.
If you ask helloai's /api/recommend endpoint for coding today, Claude Opus 4.8 wins — but not by the margin the leaderboard alone would suggest. The tracked set spans just 28 Elo points from Opus 4.8 at 1503 down to Qwen3.7-Max at 1475, yet input pricing runs from $1.25 per million tokens for Grok 4.3 to $5 for Opus 4.8 and GPT-5.5. Output spreads are wider still: $2.50 for Grok against $30 for GPT-5.5. On coding tasks the recommendation engine weights category fit at 40 percent, Elo at 35 percent, cost efficiency at 15 percent, and context window at 10 percent. Opus 4.8 is the Coding and Engineering category leader, posts the highest Elo in the set, and carries the benchmark story — 88.6 percent on SWE-bench Verified and 69.2 percent on SWE-bench Pro — so it tops the stack when you want the default answer with no budget constraint.
The interesting split is what happens when cost enters the picture. Qwen3.7-Max scores second for coding in the same formula: it lists Coding and Engineering among its strengths, sits ten Elo points below Opus, and prices at roughly $2.50 input and $7.50 output — half the input cost of the closed flagships and a fraction of GPT-5.5's output rate. Its SWE-bench Verified score of about 80.4 percent is real, not marketing rounding, and it has documented 35-hour multi-agent runs. For teams running agent loops that burn output tokens on tool results and retries, that price gap compounds faster than a ten-point Elo deficit. Grok 4.3 ranks lower on pure coding fit — it leads Honest Daily Use, not engineering — but at $1.25/$2.50 it is the cheapest way to stay inside the tracked frontier. That makes it the right pick for high-volume codegen drafts, lint fixes, and test scaffolding where you will review every diff anyway.
GPT-5.5 occupies an awkward middle on this chart. It posts strong agentic numbers — 82.7 percent on Terminal-Bench 2.0 and 73.1 percent on Expert-SWE — and matches Opus on input price at $5 per million, but its $30 output rate is the highest in the set. The recommendation engine penalizes that on cost-weighted tasks unless you filter it in explicitly for provider reasons. Gemini 3.1 Pro is not a coding specialist in our taxonomy; it leads Hard Reasoning and Science. Unless your pipeline is multimodal or reasoning-heavy inside the same repo, it will not top a coding query even at competitive $2/$12 pricing.
Practical routing looks like this. Reach for Opus 4.8 when the job is codebase-scale — migrations, parallel subagent workflows, or SWE-bench Pro-class problems where the hardest evals are the point. Switch to Qwen3.7-Max when you need sustained agentic execution at scale and can accept slightly lower arena rank for half the bill. Use Grok 4.3 as a cost firewall for iterative work that humans gate before merge. Reserve GPT-5.5 for terminal-native agent stacks where Terminal-Bench-shaped tool chains are the bottleneck and output cost is a rounding error. None of these choices require chasing a bigger Elo number; the frontier is already a near-tie.
The larger lesson is that coding procurement in mid-2026 has decoupled from the leaderboard. Nineteen to twenty-eight Elo points separate the models developers actually ship, while economics and task shape diverge by an order of magnitude. The honest workflow is to call /api/recommend?task=coding with your real budget and context constraints, then override only when you know your pipeline's shape — terminal agents, multimodal inputs, or subagent orchestration — that the generic weights underweight. The best model for coding is not the highest score. It is the one whose price and behavior match how your agents actually run overnight.