Using MLX as inference backend for autonomous multi-agent pipeline — ANF project #3260

trgysvc · 2026-03-15T20:44:53Z

trgysvc
Mar 15, 2026

Hey MLX community,

I'm building ANF (Autonomous Native Forge) — a cloud-free, 4-agent autonomous software production pipeline.

Currently running on NVIDIA Blackwell GB10 with vLLM + DeepSeek-R1-32B. Now porting to Apple Silicon.

The agent pipeline communicates via OpenAI-compatible API (/v1/chat/completions). 4 agents hit the inference endpoint concurrently with long context requests (up to 32K tokens).

Three questions:

mlx-lm has an OpenAI-compatible server — how stable is it for sustained concurrent requests on M4 Ultra?
Any known limitations with 32K context on MLX vs vLLM?
Recommended model format for DeepSeek-R1 on Apple Silicon — MLX-converted weights vs GGUF (llama.cpp Metal backend)?

Repo for context:
github.com/trgysvc/AutonomousNativeForge

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using MLX as inference backend for autonomous multi-agent pipeline — ANF project #3260

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Using MLX as inference backend for autonomous multi-agent pipeline — ANF project #3260

Uh oh!

trgysvc Mar 15, 2026

Replies: 0 comments

trgysvc
Mar 15, 2026