Review2 minMay 22, 2026

Gemini 3.5 Flash: Faster, Cheaper, and Beating 3.1 Pro

Gemini 3.5 Flash beats its own 3.1 Pro flagship on agentic and coding benchmarks at $1.50/$9 per million tokens — the most cost-competitive frontier-class model launched this week.

Google's cheaper model just beat its expensive one where it counts. Gemini 3.5 Flash, launched at I/O on May 19, scores 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas — both above Gemini 3.1 Pro on the same evals. For developers picking a model for agentic and coding workloads, that inverts the usual logic: the budget tier is now the better tool, and it's generally available today, including inside GitHub Copilot.

The pricing makes the comparison harder to argue with. Flash runs $1.50 input and $9.00 output per million tokens, 25% under 3.1 Pro's $2/$12 on both sides. You pay less and get higher agentic and terminal-task scores. On Arena Elo it lands at 1503, within two points of Claude Opus 4.7 — at roughly a third of the price, per RD World Online. The cost-per-capability curve for frontier-class agents shifted this week.

The crown does not transfer cleanly, though. Gemini 3.1 Pro still leads on pure reasoning: 77.1% on ARC-AGI-2 against Flash's 72.1%, and 44.4% on Humanity's Last Exam against 40.2%. If your workload leans on hard, multi-step abstract reasoning rather than tool use and code execution, Pro remains the stronger pick. Flash wins the benchmarks that map to building agents; it does not win the ones that map to solving novel puzzles.

That split is the whole decision. Most production agent work — running terminal commands, chaining tool calls through MCP, editing code — lives closer to Terminal-Bench and MCP Atlas than to ARC-AGI-2. For those tasks, paying the Pro premium now buys you a reasoning edge you may never exercise. The 1,048,576-token input window and 65,536-token output ceiling, with a January 2026 knowledge cutoff, hold up for long-context agent loops too.

The open question is what happens next month. Gemini 3.5 Pro is reportedly in internal testing and expected soon. If Pro follows the same arc Flash just traced — beating the previous flagship at a lower price — the reasoning advantage that still justifies 3.1 Pro could evaporate within weeks. Anyone standardizing an agent stack on a single model right now should treat that choice as a 30-day commitment, not a year-long one.

← More from Hello, AI