We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found. #123

1bcMax · 2026-03-29T04:19:49Z

1bcMax
Mar 29, 2026
Maintainer

We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found.

March 16, 2026 | BlockRun Engineering

Last week we ran every model on BlockRun through a real-world latency benchmark — 39 models, same prompts, same payment pipeline, same hardware. No cherry-picked results. No synthetic lab conditions. Just cold, hard numbers from production infrastructure.

The results changed how we route requests.

Why We Did This

BlockRun is an x402 micropayment gateway that sits between your AI agent and 39+ LLM providers. Every request flows through our payment verification layer before hitting the model API. That means our latency numbers include everything a real user experiences: payment auth, provider API call, and response delivery.

Most benchmarks measure model speed in isolation. We wanted to measure what users actually feel.

The Leaderboard

We sent 2 coding prompts per model (256 max tokens, non-streaming) and measured end-to-end response time.

Speed Rankings (End-to-End Latency Through BlockRun)

#	Model	Latency	Tok/s	$/1M in	$/1M out
1	xai/grok-4-fast-non-reasoning	1,143ms	224	$0.20	$0.50
2	xai/grok-3-mini	1,202ms	215	$0.30	$0.50
3	google/gemini-2.5-flash	1,238ms	208	$0.15	$0.60
4	xai/grok-3	1,244ms	207	$3.00	$15.00
5	xai/grok-4-1-fast-non-reasoning	1,244ms	206	$0.20	$0.50
6	nvidia/gpt-oss-120b	1,252ms	204	FREE	FREE
7	minimax/minimax-m2.5	1,278ms	202	$0.30	$1.10
8	google/gemini-2.5-pro	1,294ms	198	$1.25	$10.00
9	xai/grok-4-fast-reasoning	1,298ms	198	$0.20	$0.50
10	xai/grok-4-0709	1,348ms	190	$0.20	$1.50
11	google/gemini-3-pro-preview	1,352ms	190	$1.25	$10.00
12	google/gemini-2.5-flash-lite	1,353ms	193	$0.10	$0.40
13	google/gemini-3-flash-preview	1,398ms	183	$0.15	$0.60
14	deepseek/deepseek-chat	1,431ms	179	$0.27	$1.10
15	deepseek/deepseek-reasoner	1,454ms	183	$0.55	$2.19
16	xai/grok-4-1-fast-reasoning	1,454ms	176	$0.20	$0.50
17	google/gemini-3.1-pro	1,609ms	167	$1.25	$10.00
18	moonshot/kimi-k2.5	1,646ms	156	$0.60	$3.00
19	anthropic/claude-sonnet-4.6	2,110ms	121	$3.00	$15.00
20	anthropic/claude-opus-4.6	2,139ms	120	$15.00	$75.00
21	openai/o3-mini	2,260ms	114	$1.10	$4.40
22	openai/gpt-5-mini	2,264ms	114	$1.10	$4.40
23	anthropic/claude-haiku-4.5	2,305ms	141	$0.80	$4.00
24	openai/o4-mini	2,328ms	111	$1.10	$4.40
25	openai/gpt-4.1-mini	2,340ms	109	$0.40	$1.60
26	openai/o1	2,562ms	100	$15.00	$60.00
27	openai/gpt-4.1-nano	2,640ms	97	$0.10	$0.40
28	openai/o1-mini	2,746ms	93	$1.10	$4.40
29	openai/gpt-4o-mini	2,764ms	93	$0.15	$0.60
30	openai/o3	2,862ms	90	$2.00	$8.00
31	openai/gpt-5-nano	3,187ms	81	$0.50	$2.00
32	openai/gpt-5.2-pro	3,546ms	73	$2.50	$10.00
33	openai/gpt-4o	5,378ms	48	$2.50	$10.00
34	openai/gpt-4.1	5,477ms	47	$2.00	$8.00
35	openai/gpt-5.3	5,910ms	43	$2.50	$10.00
36	openai/gpt-5.4	6,213ms	41	$2.50	$15.00
37	openai/gpt-5.2	6,507ms	40	$2.50	$10.00
38	openai/gpt-5.4-pro	6,671ms	40	$2.50	$15.00
39	openai/gpt-5.3-codex	7,935ms	32	$2.50	$10.00

Three Things That Surprised Us

1. xAI Grok is Absurdly Fast

Grok 4 Fast clocked in at 1,143ms end-to-end. That's the full round trip: payment verification, API call, response. For context, OpenAI's GPT-5.4 took 6,213ms for the same request — nearly 6x slower.

The entire xAI lineup dominated the top of the leaderboard. Five of the top 10 fastest models are from xAI. At $0.20 per million input tokens, they're also among the cheapest.

2. Google Gemini Owns the Efficiency Frontier

Gemini 2.5 Flash delivered 1,238ms latency at $0.15/$0.60 per million tokens. For simple tasks, it's the clear winner on cost-per-quality.

But here's what's more impressive: Gemini 2.5 Pro came in at 1,294ms — barely slower than Flash — while scoring significantly higher on intelligence benchmarks. Google's infrastructure advantage is showing.

Six Google models landed in the top 13. No other provider came close to that kind of lineup depth.

3. OpenAI Flagship Models Are Surprisingly Slow

Every OpenAI model with "5.x" in the name landed in the bottom third of the leaderboard. GPT-5.3 Codex was dead last at 7,935ms. Even GPT-4o, a model from 2024, took over 5 seconds.

OpenAI's "mini" and "nano" variants are faster (2.2-3.2s range) but still 2x slower than the fastest competitors. The speed gap is real and consistent across their entire lineup.

Speed vs. Intelligence: The Tradeoff That Broke Our Routing

We cross-referenced our latency data with quality scores from Artificial Analysis (Intelligence Index v4.0):

Model	BlockRun Latency	Intelligence Index	Price Tier
Gemini 3.1 Pro	1,609ms	57	$1.25/$10
GPT-5.4	6,213ms	57	$2.50/$15
GPT-5.3 Codex	7,935ms	54	$2.50/$10
Claude Opus 4.6	2,139ms	53	$15/$75
Claude Sonnet 4.6	2,110ms	52	$3/$15
Kimi K2.5	1,646ms	47	$0.60/$3
Gemini 3 Flash Preview	1,398ms	46	$0.15/$0.60
Grok 4	1,348ms	41	$0.20/$1.50
Grok 4.1 Fast	1,244ms	41	$0.20/$0.50
DeepSeek V3	1,431ms	32	$0.27/$1.10
Grok 3	1,244ms	32	$3/$15
Grok 4 Fast	1,143ms	23	$0.20/$0.50
Gemini 2.5 Flash	1,238ms	20	$0.15/$0.60

Gemini 3.1 Pro is the standout: highest intelligence score (57) at just 1.6 seconds. GPT-5.4 matches its intelligence but takes 4x longer.

We initially used these numbers to promote fast models (Grok 4 Fast, Grok 4.1 Fast) as our default routing targets. It backfired. Users reported that the fast models were refusing complex tasks and giving shallow responses. Fast and cheap doesn't mean capable.

The fix: we now weight quality and user retention alongside speed in our routing algorithm. Gemini 2.5 Flash became our default for simple tasks (fast, cheap, reliable), while Kimi K2.5 handles medium-complexity work and Claude/GPT flagships handle the hard stuff.

What This Means for Developers

If you're building agents: Don't default to GPT. At 5-7 seconds per call, your agent's chain-of-actions will feel sluggish. Route simple subtasks to Grok/Gemini Flash and save the flagships for reasoning-heavy steps.

If you're cost-sensitive: Gemini 2.5 Flash-Lite at $0.10/$0.40 with 1.35s latency is the budget king. DeepSeek Chat at $0.27/$1.10 with 1.43s is a close second.

If you need peak intelligence: Gemini 3.1 Pro (IQ 57, 1.6s) gives you the same quality as GPT-5.4 (IQ 57, 6.2s) at one-quarter the latency and lower cost. Claude Opus 4.6 (IQ 53, 2.1s) is the best option if you need Anthropic-family capabilities.

If you want it all handled for you: That's what BlockRun's smart router does. Set your profile to auto and we'll pick the right model based on task complexity, balancing speed, quality, and cost automatically.

Methodology

Date: March 16, 2026
Setup: BlockRun ClawRouter v0.12.47 proxy on localhost, connected to BlockRun's x402 payment gateway on Base (EVM)
Prompts: 3 Python coding tasks (IPv4 validation, LCS algorithm, LRU cache), 2 requests per model
Config: 256 max tokens, non-streaming, temperature 0.7
Latency: End-to-end wall clock time including x402 payment verification (~50-100ms overhead)
Intelligence scores: Artificial Analysis Intelligence Index v4.0 (March 2026)

Raw benchmark data: benchmark-results.json

BlockRun is the x402 micropayment gateway for AI. One wallet, 39+ models, pay-per-request with USDC. Get started

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found. #123

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found. #123

Uh oh!

1bcMax Mar 29, 2026 Maintainer

We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found.

Why We Did This

The Leaderboard

Speed Rankings (End-to-End Latency Through BlockRun)

Three Things That Surprised Us

1. xAI Grok is Absurdly Fast

2. Google Gemini Owns the Efficiency Frontier

3. OpenAI Flagship Models Are Surprisingly Slow

Speed vs. Intelligence: The Tradeoff That Broke Our Routing

What This Means for Developers

Methodology

Replies: 0 comments

1bcMax
Mar 29, 2026
Maintainer