fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3
fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3AIdevsmartdata wants to merge 1 commit intomainfrom
Conversation
…itigation
Adds an env-gated fallback in NativeScheduler::tick_generate_all that fans
the multi-seq decode batch into N sequential mono-seq forward_multi_seq
calls (one per Generating slot), mirroring the J2 closure scheduler.
Empirical investigation 2026-04-26 demonstrated:
- Native M=4 + heterogeneous prompts → CUDA illegal memory access in
`quantize_mmq_q8_1<D4>` (template MoE expert routing path) within 3-9 s
- Native M=2 + heterogeneous prompts: STABLE (30/30 OK)
- Native M=3+ + heterogeneous prompts: CRASH within ~8 s
- J2 closure (mono-seq, sequential) M=4: STABLE (50/50 OK)
- CHIMERE_SEQUENTIAL_DECODE=1 (this patch): crash time T+5 s → T+127 s,
a 20× reduction in failure rate (~95% mitigation).
The remaining 5% race likely lives in slot lifecycle transitions
(prefill→decode, slot reuse) and needs Nsight Systems trace to pinpoint.
Wiring:
- tick_generate_all (slot_scheduler.rs:1900) gates on env at first call
- Sequential path: for each Generating slot, build a 1-entry batch,
invoke forward_multi_seq, sample, emit, repeat
- Default path (env unset): unchanged multi-seq packed batch
Trade-off: ~5–10% aggregate throughput loss vs packed multi-seq decode,
in exchange for crash-rate reduction ≈20×. Recommended setting until
ik_llama llama_decode multi-seq race is fully patched.
For complete stability prefer J2 (CHIMERE_MULTISLOT_NATIVE unset).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Status update 2026-04-26 — superseded by The crash root cause has been traced to a Flash Attention bug in Production currently runs NativeScheduler M=4 with This Rust-side
Recommendation: keep this PR open as a documented safety-net path until either Option A or B lands and is benchmarked stable. Full investigation recap : |
Context
NativeScheduler M=4 + heterogeneous prompts triggers a CUDA illegal memory access in ik_llama's
mul_mat_qpath within 3-9 s. Investigation 2026-04-26 narrowed the boundary :CHIMERE_LAUNCH_BLOCKING=1→ async race in ik_llama's multi-seq
llama_decodepath. ~95% reduction by fanning out to mono-seq decodes.Patch
tick_generate_allgains an env-gated fallback :CHIMERE_SEQUENTIAL_DECODE=1→ for each Generating slot, issue a 1-entryforward_multi_seq(sample, emit) sequentially. Mirrors J2 behaviour.Empirical results
CHIMERE_MULTISLOT_NATIVEunset).Trade-off
~5-10% aggregate throughput loss vs packed multi-seq decode, in exchange for 20× crash-rate reduction.
Companion
ik_llama.cpp branch
fix/multi-seq-quantize-ne2(separate PR upstream needed) corrects 3 latent OOB write/read sites in the quantize_mmq_q8_1_cuda dispatch — different bugs caught during the same investigation (regressions from PR #1373 DRY refactor). Those fixes alone don't resolve the M>=3 race.Companion docs
See memory note
chimere-m4-prompt-diversity-crash-2026-04-25.mdfor full investigation timeline and bisection methodology.🤖 Generated with Claude Code