Analysis2 minJul 4, 2026

Frontier Models Score Under 1% on ARC-AGI-3

ARC-AGI-3 launched March 25 with humans at 100% and frontier AI at 0.51%. GPT-5.5 and Opus 4.7 barely move the needle — exposing a gap arena Elo cannot see.

On March 25, ARC Prize Foundation shipped ARC-AGI-3 — the first interactive reasoning benchmark where agents explore novel game environments with no instructions, infer the rules, and carry what they learn across levels. The launch headline was blunt: humans score 100%, frontier AI scores 0.51%. That is not a rounding error. It measures whether models can adapt in real time to an environment nobody described in training data — a skill none of helloai's tracked frontier entries have demonstrated.

The gap holds when you zoom in on models helloai actually tracks. ARC Prize's May analysis of 160 replays put GPT-5.5 at 0.43% and Claude Opus 4.7 at 0.18% on the semi-private dataset — both under one percent on 135 hand-crafted environments where controlled human testing confirmed every level is solvable. Over one million games have been played since launch. Arena Elo tells a different story: helloai's six-model set spans just 33 points from Fable 5 at 1508 to Qwen3.7-Max at 1475. ARC-AGI-3 suggests those ranks measure a narrower skill than interactive adaptation.

The replays explain the flat scores. ARC Prize identified three failure modes: models perceive local effects but fail to build global world models; they map unfamiliar mechanics to known games like Tetris or Sokoban and waste actions on wrong affordances; and they sometimes clear an early level by coincidence, then carry a wrong theory forward. Opus compresses into confident-but-wrong theories; GPT-5.5 generates wider hypotheses but struggles to commit. Both land under one percent.

For developers using helloai's /api/recommend endpoint, the implication is task-shape dependent. ARC-AGI-3 is not a critique of coding on SWE-bench or reasoning on ARC-AGI-2's static puzzles — those scores remain valid for those workloads. Interactive adaptation — exploring a new UI, inferring undocumented tool behavior, recovering when environments change mid-run — is a different axis. None of the tracked frontier models have demonstrated it at human efficiency on ARC-AGI-3's action-scoring framework.

ARC Prize 2026 is live with over two million dollars in prizes for open-source ARC-AGI-3 solutions. helloai will keep tracking arena rank, pricing, and category fit for workloads builders ship today. But the Elo convergence that felt definitive in May looks thinner when models leading coding and reasoning leaderboards cannot crack one percent on environments any human can learn in a single sitting. The next capability worth watching is not another SWE-bench point — it is whether any frontier model can close the exploration gap without instructions.

← More from Hello, AI