We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found. #123
1bcMax
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found.
March 16, 2026 | BlockRun Engineering
Last week we ran every model on BlockRun through a real-world latency benchmark — 39 models, same prompts, same payment pipeline, same hardware. No cherry-picked results. No synthetic lab conditions. Just cold, hard numbers from production infrastructure.
The results changed how we route requests.
Why We Did This
BlockRun is an x402 micropayment gateway that sits between your AI agent and 39+ LLM providers. Every request flows through our payment verification layer before hitting the model API. That means our latency numbers include everything a real user experiences: payment auth, provider API call, and response delivery.
Most benchmarks measure model speed in isolation. We wanted to measure what users actually feel.
The Leaderboard
We sent 2 coding prompts per model (256 max tokens, non-streaming) and measured end-to-end response time.
Speed Rankings (End-to-End Latency Through BlockRun)
Three Things That Surprised Us
1. xAI Grok is Absurdly Fast
Grok 4 Fast clocked in at 1,143ms end-to-end. That's the full round trip: payment verification, API call, response. For context, OpenAI's GPT-5.4 took 6,213ms for the same request — nearly 6x slower.
The entire xAI lineup dominated the top of the leaderboard. Five of the top 10 fastest models are from xAI. At $0.20 per million input tokens, they're also among the cheapest.
2. Google Gemini Owns the Efficiency Frontier
Gemini 2.5 Flash delivered 1,238ms latency at $0.15/$0.60 per million tokens. For simple tasks, it's the clear winner on cost-per-quality.
But here's what's more impressive: Gemini 2.5 Pro came in at 1,294ms — barely slower than Flash — while scoring significantly higher on intelligence benchmarks. Google's infrastructure advantage is showing.
Six Google models landed in the top 13. No other provider came close to that kind of lineup depth.
3. OpenAI Flagship Models Are Surprisingly Slow
Every OpenAI model with "5.x" in the name landed in the bottom third of the leaderboard. GPT-5.3 Codex was dead last at 7,935ms. Even GPT-4o, a model from 2024, took over 5 seconds.
OpenAI's "mini" and "nano" variants are faster (2.2-3.2s range) but still 2x slower than the fastest competitors. The speed gap is real and consistent across their entire lineup.
Speed vs. Intelligence: The Tradeoff That Broke Our Routing
We cross-referenced our latency data with quality scores from Artificial Analysis (Intelligence Index v4.0):
Gemini 3.1 Pro is the standout: highest intelligence score (57) at just 1.6 seconds. GPT-5.4 matches its intelligence but takes 4x longer.
We initially used these numbers to promote fast models (Grok 4 Fast, Grok 4.1 Fast) as our default routing targets. It backfired. Users reported that the fast models were refusing complex tasks and giving shallow responses. Fast and cheap doesn't mean capable.
The fix: we now weight quality and user retention alongside speed in our routing algorithm. Gemini 2.5 Flash became our default for simple tasks (fast, cheap, reliable), while Kimi K2.5 handles medium-complexity work and Claude/GPT flagships handle the hard stuff.
What This Means for Developers
If you're building agents: Don't default to GPT. At 5-7 seconds per call, your agent's chain-of-actions will feel sluggish. Route simple subtasks to Grok/Gemini Flash and save the flagships for reasoning-heavy steps.
If you're cost-sensitive: Gemini 2.5 Flash-Lite at $0.10/$0.40 with 1.35s latency is the budget king. DeepSeek Chat at $0.27/$1.10 with 1.43s is a close second.
If you need peak intelligence: Gemini 3.1 Pro (IQ 57, 1.6s) gives you the same quality as GPT-5.4 (IQ 57, 6.2s) at one-quarter the latency and lower cost. Claude Opus 4.6 (IQ 53, 2.1s) is the best option if you need Anthropic-family capabilities.
If you want it all handled for you: That's what BlockRun's smart router does. Set your profile to
autoand we'll pick the right model based on task complexity, balancing speed, quality, and cost automatically.Methodology
Raw benchmark data: benchmark-results.json
BlockRun is the x402 micropayment gateway for AI. One wallet, 39+ models, pay-per-request with USDC. Get started
Beta Was this translation helpful? Give feedback.
All reactions