Analysis2 minJul 4, 2026

How Caching and Batching Cut Frontier API Costs by 90%

helloai's leaderboard shows nominal per-token rates. Prompt caching and batch APIs stack underneath — turning a $5/MTok flagship into $0.25 on repeated context.

helloai's models.json lists nominal per-million-token rates: Opus 4.8 at $5/$25, GPT-5.5 at $5/$30, Gemini 3.1 Pro at $2/$12. Those numbers are real — and misleading for any production workload that repeats system prompts, tool schemas, or document context across thousands of calls. Every major frontier provider now ships prompt caching and a batch API underneath the headline price card. Stack them correctly and effective input cost drops by an order of magnitude without switching models.

Anthropic's published multipliers make the math explicit. A cache read costs 0.1× the base input rate — 90% off. Opus 4.8's $5/MTok base becomes $0.50/MTok on a cache hit. The Batch API halves both input and output: $2.50/$12.50 for the same model. Anthropic documents that caching multipliers stack with batch pricing. A cached input token submitted through the Batch API on Opus 4.8 lands at $0.25/MTok — 95% below the rate helloai displays. The trade-off is latency: batch jobs run asynchronously, and cache writes cost 1.25× base on first storage.

OpenAI and Google mirror the pattern with different labels. OpenAI enables prompt caching automatically on prompts of 1,024 tokens or longer, advertising up to 90% input savings and 80% latency reduction; GPT-5.5's batch tier lists $2.50 input versus $5 standard — a straight 50% cut. Gemini 3.1 Pro Preview drops from $2/$12 to $1/$6 on the Batch API for prompts under 200k tokens, with context caching at $0.20/MTok versus $2.00 standard — another 90% reduction on stored context. helloai tracks Grok 4.3 as the cheapest nominal frontier at $1.25/$2.50, but xAI's public docs list no equivalent caching or batch tier today. For agentic workloads with fat system prompts, Grok's headline rate advantage narrows once competitors serve repeat context at ten cents on the dollar.

The engineering implication is structural, not cosmetic. Agent harnesses — Claude Code, Codex, custom tool loops — resend the same instructions, MCP tool definitions, and repository context on every turn. That is precisely the traffic pattern caching was built for: static prefix first, variable user message last. Teams billing against nominal leaderboard rates are often paying five to twenty times what optimized routing would cost. The /api/recommend endpoint weights cost efficiency at 15% of its score using those nominal figures; a workload-aware cost model would reorder several matchups, especially where Gemini's $2 headline rate competes with Opus after discounts.

helloai will keep publishing nominal rates because they are comparable across providers and stable week to week. But the leaderboard row is a ceiling, not a quote. Before the next model swap, audit what fraction of your tokens are cache-eligible and which requests can tolerate batch latency. The frontier intelligence race gets the headlines; the caching-and-batching stack underneath is where most teams will actually win or lose on margin.

← More from Hello, AI