Skip to content

--pp-partition` ignored on dual identical GPUs — always loads PP=[48] GPU-only mode causing OOM #8

@Jatilq

Description

@Jatilq

Hardware:

  • 2x NVIDIA GeForce RTX 3060 12GB (identical cards, indices 0 and 1)
  • 256GB DDR4 System RAM
  • Dual Xeon CPU

Krasis versions tested: 0.1.63, 0.1.64, 0.1.64-rc1, 0.1.64-rc2, 0.1.65-rc1 (current)

Model: Qwen3.5-122B-A10B (48 layers, 256 experts, 10B active params)

Config file used:

MODEL_PATH="/home/jatilq/.krasis/models/qwen122b-native"
CFG_SELECTED_GPUS="0,1"
CFG_PP_PARTITION="24,24"
CFG_LAYER_GROUP_SIZE="2"
CFG_KV_CACHE_MB="200"
CFG_KV_DTYPE="fp8_e4m3"
CFG_GPU_EXPERT_BITS="4"
CFG_ATTENTION_QUANT="awq"
CFG_SHARED_EXPERT_QUANT="int8"
CFG_DENSE_MLP_QUANT="int8"
CFG_LM_HEAD_QUANT="int8"
CFG_KRASIS_THREADS="32"
CFG_VRAM_SAFETY_MARGIN="265"

CLI flags also tried:

--num-gpus 2 --selected-gpus 0,1 --pp-partition 24,24

Expected behavior:

KrasisModel: 48 layers, PP=[24, 24], 2 GPUs

Actual behavior every time regardless of version or method:

KrasisModel: 48 layers, PP=[48], 1 GPUs, GPU-only mode
HCS strategy: PP=1, 2 GPUs available

Result:

CUDA out of memory. Tried to allocate 2.84 GiB. 
GPU 0 has 2.07 GiB free. GPU 1 is idle.

What was tried:

  • TUI wizard with both GPUs selected (shows 24,576MB total correctly)
  • Config file with CFG_PP_PARTITION="24,24"
  • CLI flags --pp-partition 24,24
  • All three methods produce identical PP=[48] output
  • The TUI launch screen correctly shows PP partition: 24,24 and budget of 11,247/12,288MB — but the server ignores it and loads PP=[48]
  • PYTORCH_ALLOC_CONF=expandable_segments:True set in all attempts

Note: The TUI correctly detects both GPUs and shows combined 24,576MB VRAM. The budget display with PP=24,24 shows 11,247/12,288MB which is under budget. But the server always overrides to PP=[48] GPU-only regardless of what is configured. GPU 1 remains idle while GPU 0 OOMs.

Your release notes for 0.1.64 show Multi-GPU INT4 BF16 test results — what hardware was that tested on and is there a known limitation with identical consumer cards?


tes/cuda.html#environment-variables)
(venv) jatilq@SLIM:~$ PYTORCH_ALLOC_CONF=expandable_segments:True
~/.krasis/venv/bin/krasis
--non-interactive
--model-path ~/.krasis/models/qwen122b-native
--num-gpus 2
--selected-gpus 0,1
--pp-partition 24,24
--kv-cache-mb 200
--kv-dtype fp8_e4m3
--gpu-expert-bits 4
--attention-quant awq
--shared-expert-quant int8
--lm-head-quant int8
--vram-safety-margin 265
--layer-group-size 2
--port 8012 2>&1 | tee ~/krasis-multigpu.log
2026-03-19 18:34:49,822 krasis.server INFO Logging to /home/jatilq/krasis.log
2026-03-19 18:34:49,822 krasis.server INFO === Config file: /tmp/krasis-niq52j9u.conf ===
2026-03-19 18:34:49,822 krasis.server INFO # Krasis launch config — 2026-03-19T18:34:44.590490
2026-03-19 18:34:49,822 krasis.server INFO MODEL_PATH="/home/jatilq/.krasis/models/qwen122b-native"
2026-03-19 18:34:49,822 krasis.server INFO CFG_SELECTED_GPUS="0,1"
2026-03-19 18:34:49,822 krasis.server INFO CFG_PP_PARTITION="24,24"
2026-03-19 18:34:49,822 krasis.server INFO CFG_LAYER_GROUP_SIZE="2"
2026-03-19 18:34:49,823 krasis.server INFO CFG_KV_CACHE_MB="200"
2026-03-19 18:34:49,823 krasis.server INFO CFG_KV_DTYPE="fp8_e4m3"
2026-03-19 18:34:49,823 krasis.server INFO CFG_GPU_EXPERT_BITS="4"
2026-03-19 18:34:49,823 krasis.server INFO CFG_CPU_EXPERT_BITS="4"
2026-03-19 18:34:49,823 krasis.server INFO CFG_ATTENTION_QUANT="awq"
2026-03-19 18:34:49,823 krasis.server INFO CFG_SHARED_EXPERT_QUANT="int8"
2026-03-19 18:34:49,823 krasis.server INFO CFG_DENSE_MLP_QUANT="int8"
2026-03-19 18:34:49,823 krasis.server INFO CFG_LM_HEAD_QUANT="int8"
2026-03-19 18:34:49,823 krasis.server INFO CFG_KRASIS_THREADS="40"
2026-03-19 18:34:49,823 krasis.server INFO CFG_HOST="0.0.0.0"
2026-03-19 18:34:49,823 krasis.server INFO CFG_PORT="8012"
2026-03-19 18:34:49,823 krasis.server INFO CFG_GPU_PREFILL_THRESHOLD="300"
2026-03-19 18:34:49,823 krasis.server INFO CFG_GGUF_PATH=""
2026-03-19 18:34:49,823 krasis.server INFO CFG_VRAM_SAFETY_MARGIN="500"
2026-03-19 18:34:49,823 krasis.server INFO CFG_FORCE_LOAD=""
2026-03-19 18:34:49,823 krasis.server INFO CFG_FORCE_REBUILD_CACHE=""
2026-03-19 18:34:49,823 krasis.server INFO CFG_BUILD_CACHE=""
2026-03-19 18:34:49,823 krasis.server INFO CFG_ENABLE_THINKING="1"
2026-03-19 18:34:49,823 krasis.server INFO CFG_SESSION_ENABLED="0"
2026-03-19 18:34:49,824 krasis.server INFO CFG_NUM_GPUS="2"
2026-03-19 18:34:49,824 krasis.server INFO === Resolved arguments ===
2026-03-19 18:34:49,824 krasis.server INFO attention_quant = 'awq'
2026-03-19 18:34:49,824 krasis.server INFO benchmark = False
2026-03-19 18:34:49,824 krasis.server INFO benchmark_only = False
2026-03-19 18:34:49,824 krasis.server INFO build_cache = ''
2026-03-19 18:34:49,824 krasis.server INFO config = None
2026-03-19 18:34:49,824 krasis.server INFO cpu_expert_bits = 4
2026-03-19 18:34:49,824 krasis.server INFO dense_mlp_quant = 'int8'
2026-03-19 18:34:49,824 krasis.server INFO draft_context = 512
2026-03-19 18:34:49,824 krasis.server INFO draft_k = 3
2026-03-19 18:34:49,824 krasis.server INFO draft_model = None
2026-03-19 18:34:49,824 krasis.server INFO enable_thinking = True
2026-03-19 18:34:49,824 krasis.server INFO force_load = False
2026-03-19 18:34:49,824 krasis.server INFO force_rebuild_cache = ''
2026-03-19 18:34:49,824 krasis.server INFO gguf_path = ''
2026-03-19 18:34:49,824 krasis.server INFO gpu_decode = True
2026-03-19 18:34:49,824 krasis.server INFO gpu_expert_bits = 4
2026-03-19 18:34:49,824 krasis.server INFO gpu_prefill_threshold = 300
2026-03-19 18:34:49,824 krasis.server INFO hcs = True
2026-03-19 18:34:49,825 krasis.server INFO heatmap_path = None
2026-03-19 18:34:49,825 krasis.server INFO host = '0.0.0.0'
2026-03-19 18:34:49,825 krasis.server INFO krasis_threads = 40
2026-03-19 18:34:49,825 krasis.server INFO kv_cache_mb = 200
2026-03-19 18:34:49,825 krasis.server INFO kv_dtype = 'fp8_e4m3'
2026-03-19 18:34:49,825 krasis.server INFO layer_group_size = 2
2026-03-19 18:34:49,825 krasis.server INFO lm_head_quant = 'int8'
2026-03-19 18:34:49,825 krasis.server INFO model_path = '/home/jatilq/.krasis/models/qwen122b-native'
2026-03-19 18:34:49,825 krasis.server INFO multi_gpu_hcs = False
2026-03-19 18:34:49,825 krasis.server INFO no_stream_attention = False
2026-03-19 18:34:49,825 krasis.server INFO note = None
2026-03-19 18:34:49,825 krasis.server INFO num_gpus = 2
2026-03-19 18:34:49,825 krasis.server INFO perplexity = False
2026-03-19 18:34:49,825 krasis.server INFO port = 8012
2026-03-19 18:34:49,825 krasis.server INFO session_enabled = False
2026-03-19 18:34:49,825 krasis.server INFO shared_expert_quant = 'int8'
2026-03-19 18:34:49,825 krasis.server INFO stream_attention = False
2026-03-19 18:34:49,825 krasis.server INFO stress_test = False
2026-03-19 18:34:49,825 krasis.server INFO temperature = 0.6
2026-03-19 18:34:49,826 krasis.server INFO timing = False
2026-03-19 18:34:49,826 krasis.server INFO vram_safety_margin = 500
Loaded config from /tmp/krasis-niq52j9u.conf: {'model_path': '/home/jatilq/.krasis/models/qwen122b-native', 'num_gpus': 2, 'layer_group_size': 2, 'kv_cache_mb': 200, 'kv_dtype': 'fp8_e4m3', 'gpu_expert_bits': 4, 'cpu_expert_bits': 4, 'attention_quant': 'awq', 'shared_expert_quant': 'int8', 'dense_mlp_quant': 'int8', 'lm_head_quant': 'int8', 'krasis_threads': 40, 'host': '0.0.0.0', 'port': 8012, 'gpu_prefill_threshold': 300, 'gguf_path': '', 'vram_safety_margin': 500, 'force_load': False, 'force_rebuild_cache': '', 'build_cache': '', 'enable_thinking': True, 'session_enabled': False}
Archived previous log → logs/krasis_20260319_182755.log

▸ Krasis — qwen122b-native
2026-03-19 18:34:49,828 krasis.server INFO ── Krasis — qwen122b-native ──
Decode: GPU | HCS: on | GPUs: 2
Experts: GPU INT4 | Attention: awq | KV: fp8_e4m3
Layer groups: 2 | KV cache: 200 MB | Threads: 40
GPU-only mode: CPU expert weights and CPU decoder skipped
2026-03-19 18:34:49,828 krasis.server INFO HCS strategy: PP=1, 2 GPUs available
2026-03-19 18:34:49,830 krasis.model INFO KrasisModel: 48 layers, PP=[48], 1 GPUs, attn=flashinfer, hybrid=12 full + 36 linear

▸ Loading model weights
2026-03-19 18:34:49,831 krasis.server INFO ── Loading model weights ──
[VRAM before-load] cuda:0: alloc=0 MB, reserved=0 MB, used=118 MB, free=11788 MB, total=11907 MB
2026-03-19 18:34:50,046 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:0: alloc=0 MB, reserved=0 MB, driver_used=118 MB, free=11788 MB, total=11907 MB
[VRAM before-load] cuda:1: alloc=0 MB, reserved=0 MB, used=112 MB, free=11796 MB, total=11909 MB
2026-03-19 18:34:50,254 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=112 MB, free=11796 MB, total=11909 MB
2026-03-19 18:34:50,255 krasis.model INFO RAM watchdog started: will exit if < 5.0% free

▸ Loading GPU weights
2026-03-19 18:34:50,255 krasis.model INFO Phase 1: Loading GPU weights (streaming INT8)...
2026-03-19 18:34:50,257 krasis.model INFO Resident attention: all 48 layers permanently on GPU0, 2 GPUs for EP
2026-03-19 18:34:50,257 krasis.model INFO Loading full base model to cuda:0...
2026-03-19 18:34:50,257 krasis.weight_loader INFO Loading embedding: model.language_model.embed_tokens.weight
2026-03-19 18:34:50,637 krasis.weight_loader INFO Layer 0 loaded in 0.1s (GPU alloc: 1635 MB, moe=True, type=linear_attention)
2026-03-19 18:34:50,708 krasis.weight_loader INFO Layer 1 loaded in 0.1s (GPU alloc: 1818 MB, moe=True, type=linear_attention)
2026-03-19 18:34:50,772 krasis.weight_loader INFO Layer 2 loaded in 0.1s (GPU alloc: 2000 MB, moe=True, type=linear_attention)
2026-03-19 18:34:50,803 krasis.weight_loader INFO Layer 3 loaded in 0.0s (GPU alloc: 2014 MB, moe=True, type=full_attention)
2026-03-19 18:34:50,804 krasis.attention INFO Layer 3: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:50,876 krasis.weight_loader INFO Layer 4 loaded in 0.1s (GPU alloc: 2460 MB, moe=True, type=linear_attention)
2026-03-19 18:34:50,944 krasis.weight_loader INFO Layer 5 loaded in 0.1s (GPU alloc: 2643 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,012 krasis.weight_loader INFO Layer 6 loaded in 0.1s (GPU alloc: 2826 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,047 krasis.weight_loader INFO Layer 7 loaded in 0.0s (GPU alloc: 2839 MB, moe=True, type=full_attention)
2026-03-19 18:34:51,047 krasis.attention INFO Layer 7: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:51,119 krasis.weight_loader INFO Layer 8 loaded in 0.1s (GPU alloc: 3030 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,188 krasis.weight_loader INFO Layer 9 loaded in 0.1s (GPU alloc: 3213 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,255 krasis.weight_loader INFO Layer 10 loaded in 0.1s (GPU alloc: 3396 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,289 krasis.weight_loader INFO Layer 11 loaded in 0.0s (GPU alloc: 3409 MB, moe=True, type=full_attention)
2026-03-19 18:34:51,289 krasis.attention INFO Layer 11: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:51,362 krasis.weight_loader INFO Layer 12 loaded in 0.1s (GPU alloc: 3601 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,430 krasis.weight_loader INFO Layer 13 loaded in 0.1s (GPU alloc: 3783 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,499 krasis.weight_loader INFO Layer 14 loaded in 0.1s (GPU alloc: 3965 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,533 krasis.weight_loader INFO Layer 15 loaded in 0.0s (GPU alloc: 3979 MB, moe=True, type=full_attention)
2026-03-19 18:34:51,533 krasis.attention INFO Layer 15: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:51,606 krasis.weight_loader INFO Layer 16 loaded in 0.1s (GPU alloc: 4169 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,673 krasis.weight_loader INFO Layer 17 loaded in 0.1s (GPU alloc: 4352 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,741 krasis.weight_loader INFO Layer 18 loaded in 0.1s (GPU alloc: 4535 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,774 krasis.weight_loader INFO Layer 19 loaded in 0.0s (GPU alloc: 4549 MB, moe=True, type=full_attention)
2026-03-19 18:34:51,774 krasis.attention INFO Layer 19: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:51,846 krasis.weight_loader INFO Layer 20 loaded in 0.1s (GPU alloc: 4739 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,914 krasis.weight_loader INFO Layer 21 loaded in 0.1s (GPU alloc: 4923 MB, moe=True, type=linear_attention)
2026-03-19 18:34:51,981 krasis.weight_loader INFO Layer 22 loaded in 0.1s (GPU alloc: 5105 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,014 krasis.weight_loader INFO Layer 23 loaded in 0.0s (GPU alloc: 5119 MB, moe=True, type=full_attention)
2026-03-19 18:34:52,014 krasis.attention INFO Layer 23: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:52,083 krasis.weight_loader INFO Layer 24 loaded in 0.1s (GPU alloc: 5310 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,148 krasis.weight_loader INFO Layer 25 loaded in 0.1s (GPU alloc: 5492 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,219 krasis.weight_loader INFO Layer 26 loaded in 0.1s (GPU alloc: 5675 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,253 krasis.weight_loader INFO Layer 27 loaded in 0.0s (GPU alloc: 5688 MB, moe=True, type=full_attention)
2026-03-19 18:34:52,253 krasis.attention INFO Layer 27: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:52,327 krasis.weight_loader INFO Layer 28 loaded in 0.1s (GPU alloc: 5879 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,396 krasis.weight_loader INFO Layer 29 loaded in 0.1s (GPU alloc: 6061 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,465 krasis.weight_loader INFO Layer 30 loaded in 0.1s (GPU alloc: 6245 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,498 krasis.weight_loader INFO Layer 31 loaded in 0.0s (GPU alloc: 6258 MB, moe=True, type=full_attention)
2026-03-19 18:34:52,499 krasis.attention INFO Layer 31: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:52,571 krasis.weight_loader INFO Layer 32 loaded in 0.1s (GPU alloc: 6449 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,640 krasis.weight_loader INFO Layer 33 loaded in 0.1s (GPU alloc: 6632 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,708 krasis.weight_loader INFO Layer 34 loaded in 0.1s (GPU alloc: 6814 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,742 krasis.weight_loader INFO Layer 35 loaded in 0.0s (GPU alloc: 6828 MB, moe=True, type=full_attention)
2026-03-19 18:34:52,742 krasis.attention INFO Layer 35: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:52,815 krasis.weight_loader INFO Layer 36 loaded in 0.1s (GPU alloc: 7019 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,883 krasis.weight_loader INFO Layer 37 loaded in 0.1s (GPU alloc: 7202 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,952 krasis.weight_loader INFO Layer 38 loaded in 0.1s (GPU alloc: 7383 MB, moe=True, type=linear_attention)
2026-03-19 18:34:52,984 krasis.weight_loader INFO Layer 39 loaded in 0.0s (GPU alloc: 7398 MB, moe=True, type=full_attention)
2026-03-19 18:34:52,985 krasis.attention INFO Layer 39: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:53,057 krasis.weight_loader INFO Layer 40 loaded in 0.1s (GPU alloc: 7588 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,125 krasis.weight_loader INFO Layer 41 loaded in 0.1s (GPU alloc: 7770 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,193 krasis.weight_loader INFO Layer 42 loaded in 0.1s (GPU alloc: 7953 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,226 krasis.weight_loader INFO Layer 43 loaded in 0.0s (GPU alloc: 7967 MB, moe=True, type=full_attention)
2026-03-19 18:34:53,226 krasis.attention INFO Layer 43: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:53,301 krasis.weight_loader INFO Layer 44 loaded in 0.1s (GPU alloc: 8158 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,369 krasis.weight_loader INFO Layer 45 loaded in 0.1s (GPU alloc: 8340 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,438 krasis.weight_loader INFO Layer 46 loaded in 0.1s (GPU alloc: 8524 MB, moe=True, type=linear_attention)
2026-03-19 18:34:53,471 krasis.weight_loader INFO Layer 47 loaded in 0.0s (GPU alloc: 8537 MB, moe=True, type=full_attention)
2026-03-19 18:34:53,471 krasis.attention INFO Layer 47: gated attention detected (q_proj dim=16384, expected=8192)
2026-03-19 18:34:53,477 krasis.weight_loader INFO Loading final norm: model.language_model.norm.weight
2026-03-19 18:34:53,478 krasis.weight_loader INFO Loading LM head: lm_head.weight (precision=int8)
2026-03-19 18:34:55,923 krasis.model INFO VRAM[after-layers-loaded] GPU0: free=2338 MB, alloc=9275 MB, reserved=9414 MB, non-pytorch=154 MB
2026-03-19 18:34:55,924 krasis.model INFO VRAM[after-layers-loaded] GPU1: free=11796 MB, alloc=0 MB, reserved=0 MB, non-pytorch=112 MB
2026-03-19 18:34:55,924 krasis.model INFO GPU0: 9276 MB allocated
2026-03-19 18:34:55,925 krasis.model INFO GPU weights loaded in 6.1s
[VRAM after-phase1-gpu-weights] cuda:0: alloc=9275 MB, reserved=9414 MB, used=9568 MB, free=2338 MB, total=11907 MB
2026-03-19 18:34:55,926 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:0: alloc=9275 MB, reserved=9414 MB, driver_used=9568 MB, free=2338 MB, total=11907 MB
[VRAM after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=112 MB, free=11796 MB, total=11909 MB
2026-03-19 18:34:55,926 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=112 MB, free=11796 MB, total=11909 MB
2026-03-19 18:34:55,926 krasis.model INFO Attention on CPU (AWQ: will quantize and upload in decode setup), GPU free: 2338 MB
Attention on CPU (AWQ pending), 2338 MB free

▸ Loading GPU expert weights from cache
2026-03-19 18:34:55,927 krasis.model INFO Phase 2: Loading GPU expert weights (INT4)...
2026-03-19 18:34:55,927 krasis.model INFO shared_expert_gate detected — Rust engine will skip shared experts (handled on GPU)
Loading GPU Marlin cache: 48 layers...
GPU Marlin cache loaded: 60.0 GB in 49s
2026-03-19 18:35:44,605 krasis.model INFO Krasis engine: 48 MoE layers, 256 experts, hidden=3072
2026-03-19 18:35:44,703 krasis.model INFO Routing weights sent to Rust engine (48 MoE layers)
2026-03-19 18:35:44,704 krasis.model INFO Expert weights loaded in 48.8s
Expert weights loaded in 49s.
[VRAM after-phase2-expert-weights] cuda:0: alloc=9275 MB, reserved=9414 MB, used=9568 MB, free=2338 MB, total=11907 MB
2026-03-19 18:35:44,704 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:0: alloc=9275 MB, reserved=9414 MB, driver_used=9568 MB, free=2338 MB, total=11907 MB
[VRAM after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=112 MB, free=11796 MB, total=11909 MB
2026-03-19 18:35:44,705 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=112 MB, free=11796 MB, total=11909 MB

▸ Initializing GPU prefill managers
2026-03-19 18:35:44,705 krasis.gpu_prefill INFO GpuPrefillManager(rank 0/1): expert_slice=[0, 256), local_count=256
2026-03-19 18:35:44,705 krasis.gpu_prefill INFO GPU prefill group_size=128
2026-03-19 18:35:44,705 krasis.gpu_prefill INFO Auto chunk_size: 100 experts (4.9 MB each, 489.5 MB budget of 2452.5 MB free, 491.5 MB reserved for intermediates)
2026-03-19 18:35:44,706 krasis.gpu_prefill INFO Layer-grouped mode: 2 layers/group, 24 groups, ~2491.4 MB per group, 48 total MoE layers
2026-03-19 18:35:44,706 krasis.gpu_prefill INFO GpuPrefillManager(engine): experts=256, hidden=3072, intermediate=1024, chunk_size=100, num_chunks=3, shared=1, scale=1.000, num_bits=4, prefill_mode=layer_grouped, layer_group_size=2
2026-03-19 18:35:44,706 krasis.model INFO GPU prefill manager created for cuda:0 (rank 0/1)
2026-03-19 18:35:44,706 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-03-19 18:35:44,706 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-03-19 18:35:44,706 krasis.gpu_prefill INFO Pre-allocating double-buffered pinned DMA buffers: 2x 1245.7 MB (w13p=805.3, w13s=25.2, w2p=402.7, w2s=12.6)
2026-03-19 18:35:46,773 krasis.gpu_prefill INFO Double-buffered pinned DMA allocated: 2x 1245.7 MB in 2.1s
2026-03-19 18:35:46,773 krasis.gpu_prefill INFO Prefill direct views: 10/48 layers (12.5 GB)
2026-03-19 18:35:46,774 krasis.gpu_prefill INFO Prefill direct views: 20/48 layers (24.9 GB)
2026-03-19 18:35:46,774 krasis.gpu_prefill INFO Prefill direct views: 30/48 layers (37.4 GB)
2026-03-19 18:35:46,774 krasis.gpu_prefill INFO Prefill direct views: 40/48 layers (49.8 GB)
2026-03-19 18:35:46,774 krasis.gpu_prefill INFO Prefill direct views: 48/48 layers (59.8 GB)
2026-03-19 18:35:46,774 krasis.gpu_prefill INFO Prefill direct views built: 48 layers, 59.8 GB (zero-copy, no extra RAM), 0.001s
2026-03-19 18:35:46,774 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM after-phase3-prefill-managers] cuda:0: alloc=9275 MB, reserved=9414 MB, used=9574 MB, free=2332 MB, total=11907 MB
2026-03-19 18:35:46,775 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:0: alloc=9275 MB, reserved=9414 MB, driver_used=9574 MB, free=2332 MB, total=11907 MB
[VRAM after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:46,775 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:46,776 krasis.kv_cache INFO KV cache: 200 MB → 1066 pages (17.1K tokens)
2026-03-19 18:35:46,781 krasis.kv_cache INFO KV cache allocated: 12 layers × 1066 pages × 16 tokens = 200 MB (gqa, gqa-split)
2026-03-19 18:35:46,781 krasis.model INFO Hybrid model: 12 full attention layers (KV cache), 36 linear attention layers
[VRAM after-kv-cache-init] cuda:0: alloc=9475 MB, reserved=9614 MB, used=9774 MB, free=2132 MB, total=11907 MB
2026-03-19 18:35:46,782 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:0: alloc=9475 MB, reserved=9614 MB, driver_used=9774 MB, free=2132 MB, total=11907 MB
[VRAM after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:46,783 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:47,471 krasis.tokenizer INFO Tokenizer loaded: vocab=248044, eos=248046, bos=-1
[VRAM after-full-load] cuda:0: alloc=9475 MB, reserved=9614 MB, used=9774 MB, free=2132 MB, total=11907 MB
2026-03-19 18:35:47,472 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:0: alloc=9475 MB, reserved=9614 MB, driver_used=9774 MB, free=2132 MB, total=11907 MB
[VRAM after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:47,472 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=118 MB, free=11790 MB, total=11909 MB
2026-03-19 18:35:47,472 krasis.model INFO Model fully loaded in 57.6s

▸ CUDA runtime warmup
2026-03-19 18:35:47,472 krasis.server INFO ── CUDA runtime warmup ──
2026-03-19 18:35:47,475 krasis.model INFO Warming up CUDA runtime on all devices: ['cuda:0', 'cuda:1']
2026-03-19 18:35:52,185 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2026-03-19 18:35:52,186 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
2026-03-19 18:35:53,592 krasis.linear_attention INFO Compiled linear attention chunk step (default/Inductor mode)
2026-03-19 18:35:54,132 krasis.linear_attention INFO Linear attention torch.compile warmup complete on cuda:0
2026-03-19 18:35:54,542 krasis.model INFO CUDA runtime warmup on cuda:0 complete: -40 MB consumed (2236 MB free before → 2276 MB free after)
2026-03-19 18:35:54,542 krasis.model INFO CUDA runtime warmup on cuda:1 complete: 40 MB consumed (12364 MB free before → 12324 MB free after)
cuBLAS + Triton kernel compilation done

▸ Setting up GPU decode store
2026-03-19 18:35:54,543 krasis.server INFO ── Setting up GPU decode store ──
2026-03-19 18:35:54,679 krasis.server CRITICAL Uncaught exception
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 1856, in
main()
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 965, in main
gpu_store = _model.setup_gpu_decode_store()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/model.py", line 3921, in setup_gpu_decode_store
lm_head_bf16 = (w_int8.float() * scale.unsqueeze(1)).to(torch.bfloat16).contiguous()
^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.84 GiB. GPU 0 has a total capacity of 11.63 GiB of which 2.07 GiB is free. Including non-PyTorch memory, this process has 9.54 GiB memory in use. Of the allocated memory 9.26 GiB is allocated by PyTorch, and 82.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2026-03-19 18:35:54,682 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-03-19 18:35:54,682 krasis.server ERROR [stderr] File "", line 198, in _run_module_as_main
File "", line 198, in _run_module_as_main
2026-03-19 18:35:54,682 krasis.server ERROR [stderr] File "", line 88, in _run_code
File "", line 88, in _run_code
2026-03-19 18:35:54,683 krasis.server ERROR [stderr] File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 1856, in
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 1856, in
2026-03-19 18:35:54,684 krasis.server ERROR [stderr] main()
main()
2026-03-19 18:35:54,684 krasis.server ERROR [stderr] File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 965, in main
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/server.py", line 965, in main
2026-03-19 18:35:54,685 krasis.server ERROR [stderr] gpu_store = _model.setup_gpu_decode_store()
gpu_store = _model.setup_gpu_decode_store()
2026-03-19 18:35:54,685 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,685 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,685 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,685 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,685 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,686 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,687 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,688 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,688 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,688 krasis.server ERROR [stderr] ^
^
2026-03-19 18:35:54,688 krasis.server ERROR [stderr] File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/model.py", line 3921, in setup_gpu_decode_store
File "/home/jatilq/.krasis/venv/lib/python3.12/site-packages/krasis/model.py", line 3921, in setup_gpu_decode_store
2026-03-19 18:35:54,690 krasis.server ERROR [stderr] lm_head_bf16 = (w_int8.float() * scale.unsqueeze(1)).to(torch.bfloat16).contiguous()
lm_head_bf16 = (w_int8.float() * scale.unsqueeze(1)).to(torch.bfloat16).contiguous()
2026-03-19 18:35:54,690 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,690 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,691 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,692 krasis.server ERROR [stderr] ^
^2026-03-19 18:35:54,692 krasis.server ERROR [stderr] ^
^
2026-03-19 18:35:54,692 krasis.server ERROR [stderr] torch
torch2026-03-19 18:35:54,692 krasis.server ERROR [stderr] .
.2026-03-19 18:35:54,692 krasis.server ERROR [stderr] OutOfMemoryError
OutOfMemoryError2026-03-19 18:35:54,692 krasis.server ERROR [stderr] :
: 2026-03-19 18:35:54,692 krasis.server ERROR [stderr] CUDA out of memory. Tried to allocate 2.84 GiB. GPU 0 has a total capacity of 11.63 GiB of which 2.07 GiB is free. Including non-PyTorch memory, this process has 9.54 GiB memory in use. Of the allocated memory 9.26 GiB is allocated by PyTorch, and 82.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
CUDA out of memory. Tried to allocate 2.84 GiB. GPU 0 has a total capacity of 11.63 GiB of which 2.07 GiB is free. Including non-PyTorch memory, this process has 9.54 GiB memory in use. Of the allocated memory 9.26 GiB is allocated by PyTorch, and 82.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(venv) jatilq@SLIM:~$

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions