Skip to content

Fix HCS soft pool OOM: DMA directly from Marlin cache#17

Open
3spky5u-oss wants to merge 2 commits intobrontoguana:mainfrom
3spky5u-oss:fix/hcs-soft-pool-host-ram-oom
Open

Fix HCS soft pool OOM: DMA directly from Marlin cache#17
3spky5u-oss wants to merge 2 commits intobrontoguana:mainfrom
3spky5u-oss:fix/hcs-soft-pool-host-ram-oom

Conversation

@3spky5u-oss
Copy link
Copy Markdown

@3spky5u-oss 3spky5u-oss commented Mar 31, 2026

Summary

  • HCS soft tier allocated per-chunk pinned host memory mirrors (~20+ GB) for batch DMA reload, duplicating data already in the page-locked Marlin cache
  • On Qwen3.5-122B with a 60 GB Marlin cache + 23 GB host mirrors, this exceeds RAM on 91 GB systems, causing OOM kills or RAM watchdog termination
  • Fix: DMA each expert directly from its source location in the already-pinned Marlin cache to the GPU soft slot, eliminating all redundant host memory

Benchmark (Qwen3.5-122B-A10B, RTX 5090, 91 GB RAM)

Metric Before (OOM) After
Status OOM killed / RAM watchdog Working
Decode (internal) 45.75 tok/s
Round trip (network) 52.76 tok/s
HCS coverage 40.2% (4940/12288)
Host RAM overhead ~23 GB 0 GB
Soft pool load time 0.92s

Test plan

  • Verified no OOM kill on 91 GB system with 122B model
  • Full benchmark suite passes (prefill + decode + network)
  • HCS soft pool loads to full VRAM capacity (40.2% coverage)
  • Soft pool reload after prefill VRAM reclaim works (sync + async paths)
  • Multi-GPU soft pool reload (not tested, same pattern change)

🤖 Generated with Claude Code

brontoguana and others added 2 commits March 30, 2026 19:37
The soft tier allocates pinned host memory mirrors for async DMA reload,
but never checked available host RAM. On systems where the Marlin expert
cache (60+ GB) already consumes most RAM, the additional 20+ GB of pinned
host mirrors pushes past system limits, triggering OOM kills or the RAM
watchdog (5% floor).

Add host RAM checks in two places:
1. Pre-loop cap: limits total soft slots based on available RAM minus
   8% safety margin (above the 5% RAM watchdog threshold)
2. Per-chunk check: stops allocation if a single chunk would breach
   the safety margin

Tested on Qwen3.5-122B-A10B with 91 GB RAM: previously OOM killed during
HCS pool loading, now successfully loads 2548 experts (20.7% coverage)
and benchmarks at 39.04 tok/s decode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The soft tier previously allocated per-chunk pinned host memory mirrors
(PinnedHostChunk) for batch DMA reload. On Qwen3.5-122B with a 60 GB
Marlin cache, these mirrors added 20+ GB of host RAM, causing OOM kills
on 91 GB systems.

Since the Marlin cache is already page-locked (cuMemHostRegister), we
can DMA each expert directly from its source location in the cache to
the GPU soft slot. This eliminates all redundant host memory while also
being faster (0.92s vs 1.67s for initial load).

Results on Qwen3.5-122B-A10B (RTX 5090, 91 GB RAM):
- Before: OOM killed during HCS pool loading
- After:  45.75 tok/s decode, 40.2% HCS coverage, 0 GB host overhead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants