Skip to content

fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3

Open
AIdevsmartdata wants to merge 1 commit intomainfrom
fix/native-sequential-decode-fallback
Open

fix(scheduler): CHIMERE_SEQUENTIAL_DECODE=1 fallback for M>=3 race#3
AIdevsmartdata wants to merge 1 commit intomainfrom
fix/native-sequential-decode-fallback

Conversation

@AIdevsmartdata
Copy link
Copy Markdown
Owner

Context

NativeScheduler M=4 + heterogeneous prompts triggers a CUDA illegal memory access in ik_llama's mul_mat_q path within 3-9 s. Investigation 2026-04-26 narrowed the boundary :

Config Result
Native M=2 + varied prompts ✅ STABLE (30/30)
Native M=3+ + varied prompts ❌ CRASH 5-10s
J2 closure M=4 ✅ STABLE (50/50)
CHIMERE_LAUNCH_BLOCKING=1 ✅ no crash (sync masks race)
Compute-sanitizer memcheck ✅ no error (instrumentation linearizes)

→ async race in ik_llama's multi-seq llama_decode path. ~95% reduction by fanning out to mono-seq decodes.

Patch

tick_generate_all gains an env-gated fallback :

  • CHIMERE_SEQUENTIAL_DECODE=1 → for each Generating slot, issue a 1-entry forward_multi_seq (sample, emit) sequentially. Mirrors J2 behaviour.
  • Default (unset) : unchanged multi-seq packed batch.

Empirical results

  • Without flag : crash T+5s
  • With flag : crash T+127s (20× improvement, ~95% mitigation)
  • For 100% stability, fall back to J2 (CHIMERE_MULTISLOT_NATIVE unset).

Trade-off

~5-10% aggregate throughput loss vs packed multi-seq decode, in exchange for 20× crash-rate reduction.

Companion

ik_llama.cpp branch fix/multi-seq-quantize-ne2 (separate PR upstream needed) corrects 3 latent OOB write/read sites in the quantize_mmq_q8_1_cuda dispatch — different bugs caught during the same investigation (regressions from PR #1373 DRY refactor). Those fixes alone don't resolve the M>=3 race.

Companion docs

See memory note chimere-m4-prompt-diversity-crash-2026-04-25.md for full investigation timeline and bisection methodology.

🤖 Generated with Claude Code

…itigation

Adds an env-gated fallback in NativeScheduler::tick_generate_all that fans
the multi-seq decode batch into N sequential mono-seq forward_multi_seq
calls (one per Generating slot), mirroring the J2 closure scheduler.

Empirical investigation 2026-04-26 demonstrated:
  - Native M=4 + heterogeneous prompts → CUDA illegal memory access in
    `quantize_mmq_q8_1<D4>` (template MoE expert routing path) within 3-9 s
  - Native M=2 + heterogeneous prompts: STABLE (30/30 OK)
  - Native M=3+ + heterogeneous prompts: CRASH within ~8 s
  - J2 closure (mono-seq, sequential) M=4: STABLE (50/50 OK)
  - CHIMERE_SEQUENTIAL_DECODE=1 (this patch): crash time T+5 s → T+127 s,
    a 20× reduction in failure rate (~95% mitigation).

The remaining 5% race likely lives in slot lifecycle transitions
(prefill→decode, slot reuse) and needs Nsight Systems trace to pinpoint.

Wiring:
  - tick_generate_all (slot_scheduler.rs:1900) gates on env at first call
  - Sequential path: for each Generating slot, build a 1-entry batch,
    invoke forward_multi_seq, sample, emit, repeat
  - Default path (env unset): unchanged multi-seq packed batch

Trade-off: ~5–10% aggregate throughput loss vs packed multi-seq decode,
in exchange for crash-rate reduction ≈20×. Recommended setting until
ik_llama llama_decode multi-seq race is fully patched.

For complete stability prefer J2 (CHIMERE_MULTISLOT_NATIVE unset).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AIdevsmartdata
Copy link
Copy Markdown
Owner Author

Status update 2026-04-26 — superseded by chimere-server NativeScheduler M=4 + FA-off workaround.

The crash root cause has been traced to a Flash Attention bug in ik_llama.cpp/ggml/src/ggml-cuda/fattn-common.cuh:783-808 (launch_fattn dequantizes K/V via 1D-linear to_fp16(K, K_f16, 1, ggml_nelements(K), stream), ignoring multi-seq view strides → OOB).

Production currently runs NativeScheduler M=4 with CHIMERE_FLASH_ATTN=0 + CHIMERE_KV_TYPE_V=1 + CHIMERE_KV_HADAMARD=0 — 200/200 hetero stress test stable over 535s sustained.

This Rust-side CHIMERE_SEQUENTIAL_DECODE=1 fallback is no longer needed. Two paths forward:

  1. Option A (in progress) — stride-aware dequant patch in launch_fattn to bypass the 1D-linear bug → restore FA-on multi-seq.
  2. Option B (WIP) — cuDNN frontend SDPA backend on aidev/ik_llama.cpp:feat/cudnn-fa-backend-wip for Blackwell sm_120.

Recommendation: keep this PR open as a documented safety-net path until either Option A or B lands and is benchmarked stable.

Full investigation recap : ~/Bureau/chimere-debug-2026-04-26/RECAP-COMPLET-2026-04-25-26.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant