Skip to content

Conversation

@leejet
Copy link
Owner

@leejet leejet commented Nov 11, 2025

.\bin\Release\sd.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\Qwen_Image-Q4_K_S.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors  --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p 'high quality raw candid photo. eastern european young woman, late teens, slim. detailed skin texture, detailed blue eyes. pinterest style. natural late afternoon sunlight, on an outdoor rooftop bar patio, thick eyebrows, long wavy dark brown hair, wearing a pink pvc crop top, a black skirt with a white floral pattern, visible makeup, black sunglasses on her head, a silver moon pendant on a black cord, a silver chain-link bracelet, sitting at a wooden table and holding cup of coffee in her right hand, her body angled towards the camera while her head is turned to her right, with a thinking expression with left hand near chin, gazing off-camera towards the city view in the background where a bar area is also visible, neon sign "DRUNKARD" on bar with retrowave font. long pink nailpolish<lora:qwen_image/Samsung:1>' --cfg-scale 3.5 --steps 20 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3 --seed 244727409499015 --lora-model-dir ..\..\ComfyUI\models\loras
output

@leejet
Copy link
Owner Author

leejet commented Nov 11, 2025

Applying LoRA at runtime will slow down inference. This is a known issue, and I’ll optimize it later when time permits.

@Green-Sky
Copy link
Contributor

Nice, also thank you for making it optional. 🚀

@stduhpf
Copy link
Contributor

stduhpf commented Nov 11, 2025

Nice! I guess the next step for optimization would be to do x * W + (x * lora_down) * lora_up instead of x * (W + lora_down * lora_up)?

@leejet
Copy link
Owner Author

leejet commented Nov 12, 2025

Yes, this should optimize the inference speed.

@wbruna
Copy link
Contributor

wbruna commented Nov 12, 2025

The at_runtime mode uses slightly more memory

How much more memory is expected?

Testing on Vulkan with SDXL (--diffusion-fa, 512x768, common 16-bit model file), I usually see ~6.6G VRAM at the beginning, then it drops to ~5.2 at inference. With --lora-apply-mode at_runtime, it goes to 9G, and stays there during inference. Inference time is more or less doubled.

The resulting images are byte-identical to the ones with --lora-apply-mode immediately. But I noticed this on the logs:

[DEBUG] model.cpp:1319 - loading tensors from ./DPM_4STEPS_A1.safetensors
  |==================================================| 2364/2364 - 11820.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
[DEBUG] model.cpp:1297 - using 4 threads for model loading
[DEBUG] model.cpp:1319 - loading tensors from ./DPM_4STEPS_A1.safetensors
  |==================================================| 2364/2364 - 11761.19it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.21s (process: 0.01s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)

@stduhpf
Copy link
Contributor

stduhpf commented Nov 12, 2025

Yes, this should optimize the inference speed.

It sure does:
x * (W + Σ(lora_down * lora_up * multiplier)) + bias:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1628 - qwen_image compute buffer size: 281.46 MB(VRAM)
  |==================================================| 24/24 - 4.51s/it
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 108.32s
[INFO ] stable-diffusion.cpp:2851 - generating 1 latent images completed, taking 108.58s

x * W + bias + Σ((x * lora_down) * multiplier * lora_up) (Linear blocks only):

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1628 - qwen_image compute buffer size: 221.46 MB(VRAM)
  |==================================================| 24/24 - 2.85s/it
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 68.54s
[INFO ] stable-diffusion.cpp:2851 - generating 1 latent images completed, taking 68.86s

(edit: for sd1.5, it looks like it's ever so slightly slower, but it saves some memory)

@Green-Sky
Copy link
Contributor

The at_runtime mode uses slightly more memory

How much more memory is expected?

This also very much depends on the lora, they come in wildly different sizes too.

@wbruna
Copy link
Contributor

wbruna commented Nov 12, 2025

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

@leejet
Copy link
Owner Author

leejet commented Nov 12, 2025

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

@wbruna Could you paste the full command line?

@leejet
Copy link
Owner Author

leejet commented Nov 12, 2025

Currently, applying LoRA at runtime should not consume additional compute buffers — at least in most cases.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 1024 -W 1024 --cfg-scale 1 --steps 4

immediately

[DEBUG] ggml_extend.hpp:1656 - unet compute buffer size: 830.86 MB(VRAM)
  |==================================================| 4/4 - 3.86it/s
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 1.11s

at_runtime

[DEBUG] ggml_extend.hpp:1656 - unet compute buffer size: 830.86 MB(VRAM)
  |==================================================| 4/4 - 2.86it/s
[INFO ] stable-diffusion.cpp:2843 - sampling completed, taking 1.47s

@wbruna
Copy link
Contributor

wbruna commented Nov 12, 2025

With Qwen on Vulkan, I get an assertion failure at ggml_vk_mul_mat_q_f16:

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

@wbruna Could you paste the full command line?

./sd --offload-to-cpu --diffusion-model qwen-image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf --lora-model-dir . -p "photo of a treelora:Qwen-Image-Lightning-4steps-V1.0-bf16:1" --cfg-scale 1 --steps 10

full output
./sd --offload-to-cpu --diffusion-model qwen-image-Q4_0.gguf --vae Qwen_Image-VAE.safetensors --qwen2vl Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf --lora-model-dir . -p photo of a tree<lora:Qwen-Image-Lightning-4steps-V1.0-bf16:1> --cfg-scale 1 --steps 10
[INFO ] stable-diffusion.cpp:219  - loading diffusion model from 'qwen-image-Q4_0.gguf'
[INFO ] model.cpp:376  - load qwen-image-Q4_0.gguf using gguf format
[INFO ] stable-diffusion.cpp:266  - loading qwen2vl from 'Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf'
[INFO ] model.cpp:376  - load Qwen2.5-VL-7B-Instruct-IQ4_XS.gguf using gguf format
[INFO ] stable-diffusion.cpp:280  - loading vae from 'Qwen_Image-VAE.safetensors'
[INFO ] model.cpp:379  - load Qwen_Image-VAE.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:303  - Version: Qwen Image 
[INFO ] stable-diffusion.cpp:330  - Weight type stat:                      f32: 1422 |    q4_0: 720  |    q4_1: 120  |    q4_K: 1    |    q5_K: 28   |  iq4_xs: 168  |    bf16: 6    
[INFO ] stable-diffusion.cpp:331  - Conditioner weight type stat:          f32: 141  |    q4_K: 1    |    q5_K: 28   |  iq4_xs: 168  
[INFO ] stable-diffusion.cpp:332  - Diffusion model weight type stat:      f32: 1087 |    q4_0: 720  |    q4_1: 120  |    bf16: 6    
[INFO ] stable-diffusion.cpp:333  - VAE weight type stat:                  f32: 194  
[INFO ] qwen_image.hpp:527  - qwen_image_params.num_layers: 60
  |=======================================>          | 1933/2465 - 69.00it/s
  |==============================================>   | 2271/2465 - 39.35it/s
  |==================================================| 2465/2465 - 41.98it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 58.72s (process: 0.00s, read: 56.12s, memcpy: 0.00s, convert: 2.06s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:720  - total params memory size = 16837.28MB (VRAM 16837.28MB, RAM 0.00MB): text_encoders 5393.90MB(VRAM), diffusion_model 11303.55MB(VRAM), vae 139.84MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:821  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3048 - TXT2IMG
[INFO ] stable-diffusion.cpp:1149 - apply at runtime
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
  |==================================================| 2160/2160 - 674.58it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 3.21s (process: 0.00s, read: 2.52s, memcpy: 0.00s, convert: 0.12s, copy_to_backend: 0.43s)
[INFO ] model.cpp:379  - load ./Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors using safetensors format
[INFO ] lora.hpp:40   - loading LoRA from './Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors'
  |==================================================| 2160/2160 - 10800.00it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.20s (process: 0.00s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[ERROR] ggml_extend.hpp:1803 - lora alloc params backend buffer failed, num_tensors = 0
  |==================================================| 2160/2160 - 10746.27it/s
[INFO ] model.cpp:1522 - loading tensors completed, taking 0.21s (process: 0.01s, read: 0.00s, memcpy: 0.00s, convert: 0.00s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:1154 - apply_loras completed, taking 4.42s
[INFO ] ggml_extend.hpp:1722 - qwenvl2.5 offload params (5393.90 MB, 338 tensors) to runtime backend (Vulkan1), taking 6.68s
[INFO ] stable-diffusion.cpp:2694 - get_learned_condition completed, taking 11320 ms
[INFO ] stable-diffusion.cpp:2712 - sampling using Euler method
[INFO ] stable-diffusion.cpp:2806 - generating image: 1/1 - seed 42
[INFO ] ggml_extend.hpp:1722 - qwen_image offload params (11303.54 MB, 1933 tensors) to runtime backend (Vulkan1), taking 30.60s
ggml/src/ggml-vulkan/ggml-vulkan.cpp:6163: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed
[New LWP 1512828]
[New LWP 1512827]
[New LWP 1512826]
[New LWP 1512825]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Arquivo ou diretório inexistente
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fe822ba9668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Arquivo ou diretório inexistente
#2  0x00007fe822ba96ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fe822c14787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Arquivo ou diretório inexistente
#4  0x000055bbf04f7feb in ggml_print_backtrace ()
#5  0x000055bbf04f8138 in ggml_abort ()
#6  0x000055bbf0423e1f in ggml_vk_mul_mat_q_f16(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, bool, bool) ()
#7  0x000055bbf042a3da in ggml_vk_mul_mat(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor*, ggml_tensor*, ggml_tensor*, bool) ()
#8  0x000055bbf04cead5 in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool, bool) ()
#9  0x000055bbf04d047b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) ()
#10 0x000055bbf05100f1 in ggml_backend_graph_compute ()
#11 0x000055bbf028a471 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#12 0x000055bbf028b081 in QwenImageModel::compute(int, DiffusionParams, ggml_tensor**, ggml_context*) ()
#13 0x000055bbf028ee3f in StableDiffusionGGML::sample(ggml_context*, std::shared_ptr<DiffusionModel>, bool, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, SDCondition, ggml_tensor*, float, sd_guidance_params_t, float, int, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*, float)::{lambda(ggml_tensor*, float, int)#1}::operator()(ggml_tensor*, float, int) const ()
#14 0x000055bbf029932c in StableDiffusionGGML::sample(ggml_context*, std::shared_ptr<DiffusionModel>, bool, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, SDCondition, ggml_tensor*, float, sd_guidance_params_t, float, int, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*, float) ()
#15 0x000055bbf0275337 in generate_image_internal(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, sd_guidance_params_t, float, int, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t, float, sd_pm_params_t, std::vector<sd_image_t*, std::allocator<sd_image_t*> >, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, bool, ggml_tensor*, ggml_tensor*) ()
#16 0x000055bbf02784cf in generate_image ()
#17 0x000055bbf01c1d2d in main ()
[Inferior 1 (process 1512824) detached]
Aborted (core dumped)

(this was with 9a35003 , I'll test again with the most recent commit)

EDIT: working now on 8850157 🙂

@leejet leejet merged commit 347710f into master Nov 13, 2025
9 checks passed
@leejet leejet deleted the lora_improve branch November 16, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants