feat: support applying LoRA at runtime #969

leejet · 2025-11-11T16:18:54Z

.\bin\Release\sd.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Qwen_Image-Q4_K_S.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors  --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p 'high quality raw candid photo. eastern european young woman, late teens, slim. detailed skin texture, detailed blue eyes. pinterest style. natural late afternoon sunlight, on an outdoor rooftop bar patio, thick eyebrows, long wavy dark brown hair, wearing a pink pvc crop top, a black skirt with a white floral pattern, visible makeup, black sunglasses on her head, a silver moon pendant on a black cord, a silver chain-link bracelet, sitting at a wooden table and holding cup of coffee in her right hand, her body angled towards the camera while her head is turned to her right, with a thinking expression with left hand near chin, gazing off-camera towards the city view in the background where a bar area is also visible, neon sign "DRUNKARD" on bar with retrowave font. long pink nailpolish<lora:qwen_image/Samsung:1>' --cfg-scale 3.5 --steps 20 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3 --seed 244727409499015 --lora-model-dir ..\..\ComfyUI\models\loras

leejet · 2025-11-11T16:26:46Z

Applying LoRA at runtime will slow down inference. This is a known issue, and I’ll optimize it later when time permits.

Green-Sky · 2025-11-11T16:44:15Z

Nice, also thank you for making it optional. 🚀

stduhpf · 2025-11-11T17:37:13Z

Nice! I guess the next step for optimization would be to do x * W + (x * lora_down) * lora_up instead of x * (W + lora_down * lora_up)?

leejet · 2025-11-12T01:06:21Z

Yes, this should optimize the inference speed.

wbruna · 2025-11-12T10:15:35Z

The at_runtime mode uses slightly more memory

How much more memory is expected?

Testing on Vulkan with SDXL (--diffusion-fa, 512x768, common 16-bit model file), I usually see ~6.6G VRAM at the beginning, then it drops to ~5.2 at inference. With --lora-apply-mode at_runtime, it goes to 9G, and stays there during inference. Inference time is more or less doubled.

The resulting images are byte-identical to the ones with --lora-apply-mode immediately. But I noticed this on the logs:

[DEBUG] model.cpp:1319 - loading tensors from ./DPM_4STEPS_A1.safetensors
  |==================================================| 2364/2364 - 11820.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
[DEBUG] model.cpp:1297 - using 4 threads for model loading
[DEBUG] model.cpp:1319 - loading tensors from ./DPM_4STEPS_A1.safetensors
  |==================================================| 2364/2364 - 11761.19it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.21s (process: 0.01s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)

stduhpf · 2025-11-12T10:33:16Z

Yes, this should optimize the inference speed.

It sure does:
x * (W + Σ(lora_down * lora_up * multiplier)) + bias:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1628 - qwen_image compute buffer size: 281.46 MB(VRAM)
  |==================================================| 24/24 - 4.51s/it
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 108.32s
[INFO ] stable-diffusion.cpp:2851 - generating 1 latent images completed, taking 108.58s

x * W + bias + Σ((x * lora_down) * multiplier * lora_up) (Linear blocks only):

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1628 - qwen_image compute buffer size: 221.46 MB(VRAM)
  |==================================================| 24/24 - 2.85s/it
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 68.54s
[INFO ] stable-diffusion.cpp:2851 - generating 1 latent images completed, taking 68.86s

(edit: for sd1.5, it looks like it's ever so slightly slower, but it saves some memory)

Green-Sky · 2025-11-12T10:40:12Z

The at_runtime mode uses slightly more memory

How much more memory is expected?

This also very much depends on the lora, they come in wildly different sizes too.

wbruna · 2025-11-12T11:41:18Z

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

leejet · 2025-11-12T15:05:38Z

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

@wbruna Could you paste the full command line?

lora.hpp

leejet · 2025-11-12T15:49:43Z

Currently, applying LoRA at runtime should not consume additional compute buffers — at least in most cases.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 1024 -W 1024 --cfg-scale 1 --steps 4

immediately

[DEBUG] ggml_extend.hpp:1656 - unet compute buffer size: 830.86 MB(VRAM)
  |==================================================| 4/4 - 3.86it/s
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 1.11s

at_runtime

[DEBUG] ggml_extend.hpp:1656 - unet compute buffer size: 830.86 MB(VRAM)
  |==================================================| 4/4 - 2.86it/s
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 1.47s

wbruna · 2025-11-12T16:59:22Z

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

@wbruna Could you paste the full command line?

./sd --offload-to-cpu --diffusion-model qwen-image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf --lora-model-dir . -p "photo of a treelora:Qwen-Image-Lightning-4steps-V1.0-bf16:1" --cfg-scale 1 --steps 10

full output

./sd --offload-to-cpu --diffusion-model qwen-image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf --lora-model-dir . -p photo of a tree<lora:Qwen-Image-Lightning-4steps-V1.0-bf16:1> --cfg-scale 1 --steps 10
[INFO ] stable-diffusion.cpp:219  - loading diffusion model from 'qwen-image-Q4_0.gguf'
[INFO ] model.cpp:376  - load qwen-image-Q4_0.gguf using gguf format
[INFO ] stable-diffusion.cpp:266  - loading qwen2vl from 'Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] model.cpp:376  - load Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf using gguf format
[INFO ] stable-diffusion.cpp:280  - loading vae from 'Qwen_Image-VAE.safetensors'
[INFO ] model.cpp:379  - load Qwen_Image-VAE.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:303  - Version: Qwen Image 
[INFO ] stable-diffusion.cpp:330  - Weight type stat:                      f32: 1422 |    q4_0: 720  |    q4_1: 120  |    q4_K: 1    |    q5_K: 28   |  iq4_xs: 168  |    bf16: 6    
[INFO ] stable-diffusion.cpp:331  - Conditioner weight type stat:          f32: 141  |    q4_K: 1    |    q5_K: 28   |  iq4_xs: 168  
[INFO ] stable-diffusion.cpp:332  - Diffusion model weight type stat:      f32: 1087 |    q4_0: 720  |    q4_1: 120  |    bf16: 6    
[INFO ] stable-diffusion.cpp:333  - VAE weight type stat:                  f32: 194  
[INFO ] qwen_image.hpp:527  - qwen_image_params.num_layers: 60
  |=======================================>          | 1933/2465 - 69.00it/s
  |==============================================>   | 2271/2465 - 39.35it/s
  |==================================================| 2465/2465 - 41.98it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 58.72s (process: 0.00s, read: 56.12s, memcpy: 0.00s, convert: 2.06s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:720  - total params memory size = 16837.28MB (VRAM 16837.28MB, RAM 0.00MB): text_encoders 5393.90MB(VRAM), diffusion_model 11303.55MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:821  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3048 - TXT2IMG
[INFO ] stable-diffusion.cpp:1149 - apply at runtime
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
  |==================================================| 2160/2160 - 674.58it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 3.21s (process: 0.00s, read: 2.52s, memcpy: 0.00s, convert: 0.12s, copy_to_backend: 0.43s)
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
  |==================================================| 2160/2160 - 10746.27it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.21s (process: 0.01s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:1154 - apply_loras completed, taking 4.42s
[INFO ] ggml_extend.hpp:1722 - qwenvl2.5 offload params (5393.90 MB, 338 tensors) to runtime backend (Vulkan1), taking 6.68s
[INFO ] stable-diffusion.cpp:2694 - get_learned_condition completed, taking 11320 ms
[INFO ] stable-diffusion.cpp:2712 - sampling using Euler method
[INFO ] stable-diffusion.cpp:2806 - generating image: 1/1 - seed 42
[INFO ] ggml_extend.hpp:1722 - qwen_image offload params (11303.54 MB, 1933 tensors) to runtime backend (Vulkan1), taking 30.60s
ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed
[New LWP 1512828]
[New LWP 1512827]
[New LWP 1512826]
[New LWP 1512825]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Arquivo ou diretório inexistente
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fe822ba9668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Arquivo ou diretório inexistente
#2  0x00007fe822ba96ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fe822c14787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Arquivo ou diretório inexistente
#4  0x000055bbf04f7feb in ggml_print_backtrace ()
#5  0x000055bbf04f8138 in ggml_abort ()
#6  0x000055bbf0423e1f in ggml_vk_mul_mat_q_f16(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, bool, bool) ()
#7  0x000055bbf042a3da in ggml_vk_mul_mat(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor*, ggml_tensor*, ggml_tensor*, bool) ()
#8  0x000055bbf04cead5 in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool, bool) ()
#9  0x000055bbf04d047b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) ()
#10 0x000055bbf05100f1 in ggml_backend_graph_compute ()
#11 0x000055bbf028a471 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#12 0x000055bbf028b081 in QwenImageModel::compute(int, DiffusionParams, ggml_tensor**, ggml_context*) ()
#13 0x000055bbf028ee3f in StableDiffusionGGML::sample(ggml_context*, std::shared_ptr<DiffusionModel>, bool, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, SDCondition, ggml_tensor*, float, sd_guidance_params_t, float, int, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*, float)::{lambda(ggml_tensor*, float, int)#1}::operator()(ggml_tensor*, float, int) const ()
#14 0x000055bbf029932c in StableDiffusionGGML::sample(ggml_context*, std::shared_ptr<DiffusionModel>, bool, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, SDCondition, ggml_tensor*, float, sd_guidance_params_t, float, int, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*, float) ()
#15 0x000055bbf0275337 in generate_image_internal(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, sd_guidance_params_t, float, int, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t, float, sd_pm_params_t, std::vector<sd_image_t*, std::allocator<sd_image_t*> >, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*) ()
#16 0x000055bbf02784cf in generate_image ()
#17 0x000055bbf01c1d2d in main ()
[Inferior 1 (process 1512824) detached]
Aborted (core dumped)

(this was with 9a35003 , I'll test again with the most recent commit)

EDIT: working now on 8850157 🙂

feat: support applying LoRA at runtime

9a35003

This was referenced Nov 11, 2025

Is it possible to use a LoRA with a K-quantized model on GPU #936

Closed

[Bug] LoRA Loaded but Not Applied: Identical seed w/ and w/o LoRA produces identical output on Qwen-Image (Vulkan) #904

Closed

[Bug] GGUF error,unsupported src0 type: q3_K #963

Closed

LostRuins mentioned this pull request Nov 12, 2025

sd: sync to master-355-694f0d9 LostRuins/koboldcpp#1841

Closed

support forward with lora

7195efa

save more memory

4008102

stduhpf reviewed Nov 12, 2025

View reviewed changes

lora.hpp Show resolved Hide resolved

save more memory

ceb0fcf

remove the redundant lx

8850157

update lora docs

f2ec08f

leejet merged commit 347710f into master Nov 13, 2025
9 checks passed

This was referenced Nov 13, 2025

Error when using Lora with quantization q4_K #733

Closed

Segmentation fault applying LoRA to Flux model in GPU accerlated program #729

Closed

Error with quantization and lora #110

Closed

leejet deleted the lora_improve branch November 16, 2025 09:37

feat: support applying LoRA at runtime #969

feat: support applying LoRA at runtime #969

Uh oh!

Conversation

leejet commented Nov 11, 2025

Uh oh!

leejet commented Nov 11, 2025

Uh oh!

Green-Sky commented Nov 11, 2025

Uh oh!

stduhpf commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leejet commented Nov 12, 2025

Uh oh!

wbruna commented Nov 12, 2025

Uh oh!

stduhpf commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 12, 2025

Uh oh!

wbruna commented Nov 12, 2025

Uh oh!

leejet commented Nov 12, 2025

Uh oh!

Uh oh!

leejet commented Nov 12, 2025

Uh oh!

wbruna commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stduhpf commented Nov 11, 2025 •

edited

Loading

stduhpf commented Nov 12, 2025 •

edited

Loading

wbruna commented Nov 12, 2025 •

edited

Loading