Fix HCS soft pool OOM: DMA directly from Marlin cache by 3spky5u-oss · Pull Request #17 · brontoguana/krasis

3spky5u-oss · 2026-03-31T02:07:46Z

Summary

HCS soft tier allocated per-chunk pinned host memory mirrors (~20+ GB) for batch DMA reload, duplicating data already in the page-locked Marlin cache
On Qwen3.5-122B with a 60 GB Marlin cache + 23 GB host mirrors, this exceeds RAM on 91 GB systems, causing OOM kills or RAM watchdog termination
Fix: DMA each expert directly from its source location in the already-pinned Marlin cache to the GPU soft slot, eliminating all redundant host memory

Benchmark (Qwen3.5-122B-A10B, RTX 5090, 91 GB RAM)

Metric	Before (OOM)	After
Status	OOM killed / RAM watchdog	Working
Decode (internal)	—	45.75 tok/s
Round trip (network)	—	52.76 tok/s
HCS coverage	—	40.2% (4940/12288)
Host RAM overhead	~23 GB	0 GB
Soft pool load time	—	0.92s

Test plan

Verified no OOM kill on 91 GB system with 122B model
Full benchmark suite passes (prefill + decode + network)
HCS soft pool loads to full VRAM capacity (40.2% coverage)
Soft pool reload after prefill VRAM reclaim works (sync + async paths)
Multi-GPU soft pool reload (not tested, same pattern change)

🤖 Generated with Claude Code

The soft tier allocates pinned host memory mirrors for async DMA reload, but never checked available host RAM. On systems where the Marlin expert cache (60+ GB) already consumes most RAM, the additional 20+ GB of pinned host mirrors pushes past system limits, triggering OOM kills or the RAM watchdog (5% floor). Add host RAM checks in two places: 1. Pre-loop cap: limits total soft slots based on available RAM minus 8% safety margin (above the 5% RAM watchdog threshold) 2. Per-chunk check: stops allocation if a single chunk would breach the safety margin Tested on Qwen3.5-122B-A10B with 91 GB RAM: previously OOM killed during HCS pool loading, now successfully loads 2548 experts (20.7% coverage) and benchmarks at 39.04 tok/s decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The soft tier previously allocated per-chunk pinned host memory mirrors (PinnedHostChunk) for batch DMA reload. On Qwen3.5-122B with a 60 GB Marlin cache, these mirrors added 20+ GB of host RAM, causing OOM kills on 91 GB systems. Since the Marlin cache is already page-locked (cuMemHostRegister), we can DMA each expert directly from its source location in the cache to the GPU soft slot. This eliminates all redundant host memory while also being faster (0.92s vs 1.67s for initial load). Results on Qwen3.5-122B-A10B (RTX 5090, 91 GB RAM): - Before: OOM killed during HCS pool loading - After: 45.75 tok/s decode, 40.2% HCS coverage, 0 GB host overhead Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brontoguana and others added 2 commits March 30, 2026 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HCS soft pool OOM: DMA directly from Marlin cache#17

Fix HCS soft pool OOM: DMA directly from Marlin cache#17
3spky5u-oss wants to merge 2 commits intobrontoguana:mainfrom
3spky5u-oss:fix/hcs-soft-pool-host-ram-oom

3spky5u-oss commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3spky5u-oss commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark (Qwen3.5-122B-A10B, RTX 5090, 91 GB RAM)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3spky5u-oss commented Mar 31, 2026 •

edited

Loading