fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race by AIdevsmartdata · Pull Request #3 · AIdevsmartdata/chimere

AIdevsmartdata · 2026-04-25T22:44:28Z

Context

NativeScheduler M=4 + heterogeneous prompts triggers a CUDA illegal memory access in ik_llama's mul_mat_q path within 3-9 s. Investigation 2026-04-26 narrowed the boundary :

Config	Result
Native M=2 + varied prompts	✅ STABLE (30/30)
Native M=3+ + varied prompts	❌ CRASH 5-10s
J2 closure M=4	✅ STABLE (50/50)
`CHIMERE_LAUNCH_BLOCKING=1`	✅ no crash (sync masks race)
Compute-sanitizer memcheck	✅ no error (instrumentation linearizes)

→ async race in ik_llama's multi-seq llama_decode path. ~95% reduction by fanning out to mono-seq decodes.

Patch

tick_generate_all gains an env-gated fallback :

CHIMERE_SEQUENTIAL_DECODE=1 → for each Generating slot, issue a 1-entry forward_multi_seq (sample, emit) sequentially. Mirrors J2 behaviour.
Default (unset) : unchanged multi-seq packed batch.

Empirical results

Without flag : crash T+5s
With flag : crash T+127s (20× improvement, ~95% mitigation)
For 100% stability, fall back to J2 (CHIMERE_MULTISLOT_NATIVE unset).

Trade-off

~5-10% aggregate throughput loss vs packed multi-seq decode, in exchange for 20× crash-rate reduction.

Companion

ik_llama.cpp branch fix/multi-seq-quantize-ne2 (separate PR upstream needed) corrects 3 latent OOB write/read sites in the quantize_mmq_q8_1_cuda dispatch — different bugs caught during the same investigation (regressions from PR #1373 DRY refactor). Those fixes alone don't resolve the M>=3 race.

Companion docs

See memory note chimere-m4-prompt-diversity-crash-2026-04-25.md for full investigation timeline and bisection methodology.

🤖 Generated with Claude Code

…itigation Adds an env-gated fallback in NativeScheduler::tick_generate_all that fans the multi-seq decode batch into N sequential mono-seq forward_multi_seq calls (one per Generating slot), mirroring the J2 closure scheduler. Empirical investigation 2026-04-26 demonstrated: - Native M=4 + heterogeneous prompts → CUDA illegal memory access in `quantize_mmq_q8_1<D4>` (template MoE expert routing path) within 3-9 s - Native M=2 + heterogeneous prompts: STABLE (30/30 OK) - Native M=3+ + heterogeneous prompts: CRASH within ~8 s - J2 closure (mono-seq, sequential) M=4: STABLE (50/50 OK) - CHIMERE_SEQUENTIAL_DECODE=1 (this patch): crash time T+5 s → T+127 s, a 20× reduction in failure rate (~95% mitigation). The remaining 5% race likely lives in slot lifecycle transitions (prefill→decode, slot reuse) and needs Nsight Systems trace to pinpoint. Wiring: - tick_generate_all (slot_scheduler.rs:1900) gates on env at first call - Sequential path: for each Generating slot, build a 1-entry batch, invoke forward_multi_seq, sample, emit, repeat - Default path (env unset): unchanged multi-seq packed batch Trade-off: ~5–10% aggregate throughput loss vs packed multi-seq decode, in exchange for crash-rate reduction ≈20×. Recommended setting until ik_llama llama_decode multi-seq race is fully patched. For complete stability prefer J2 (CHIMERE_MULTISLOT_NATIVE unset). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AIdevsmartdata · 2026-04-26T12:02:29Z

Status update 2026-04-26 — superseded by chimere-server NativeScheduler M=4 + FA-off workaround.

The crash root cause has been traced to a Flash Attention bug in ik_llama.cpp/ggml/src/ggml-cuda/fattn-common.cuh:783-808 (launch_fattn dequantizes K/V via 1D-linear to_fp16(K, K_f16, 1, ggml_nelements(K), stream), ignoring multi-seq view strides → OOB).

Production currently runs NativeScheduler M=4 with CHIMERE_FLASH_ATTN=0 + CHIMERE_KV_TYPE_V=1 + CHIMERE_KV_HADAMARD=0 — 200/200 hetero stress test stable over 535s sustained.

This Rust-side CHIMERE_SEQUENTIAL_DECODE=1 fallback is no longer needed. Two paths forward:

Option A (in progress) — stride-aware dequant patch in launch_fattn to bypass the 1D-linear bug → restore FA-on multi-seq.
Option B (WIP) — cuDNN frontend SDPA backend on aidev/ik_llama.cpp:feat/cudnn-fa-backend-wip for Blackwell sm_120.

Recommendation: keep this PR open as a documented safety-net path until either Option A or B lands and is benchmarked stable.

Full investigation recap : ~/Bureau/chimere-debug-2026-04-26/RECAP-COMPLET-2026-04-25-26.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3

fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3
AIdevsmartdata wants to merge 1 commit intomainfrom
fix/native-sequential-decode-fallback

AIdevsmartdata commented Apr 25, 2026

Uh oh!

AIdevsmartdata commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AIdevsmartdata commented Apr 25, 2026

Context

Patch

Empirical results

Trade-off

Companion

Companion docs

Uh oh!

AIdevsmartdata commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant