Add Windows Build Workflow (GitHub Actions) by Thireus · Pull Request #976 · ikawrakow/ik_llama.cpp

Thireus · 2025-11-17T15:00:23Z

First, thank you for maintaining this project — it has been very useful, and I appreciate the work that has gone into it.

I initially created a fork to add automated Windows builds for my own use, since I needed ready-to-use binaries. Since this is functionality that could benefit other users as well, I’m submitting this pull request so the Windows build workflow can live directly in the main repository instead of in my fork.

This PR includes:

GitHub Actions workflow for building the project on Windows

Automatic artifact uploads so users can download ready-to-use Windows builds

Other small tweaks to ensure the code compiles on Windows.

Builds must be manually triggered (I believe they could be automated after each commit, but I have not had the chance to dig into it). There is also a bit of cleanup to do, specifically regarding other automated jobs that run to conduct meaningless checks. The original code was obtained from mainline with some tweaking to adapt it to ik_llama.cpp.

My goal was to make the project more accessible to Windows users, specifically users who do not find the time to set up a dev env or don't have the knowledge to do it on Windows, such as myself. If you’d prefer changes to structure, naming, workflow triggers, or anything else, I’m happy to adjust the PR accordingly.

Thanks again for the project and for taking the time to review this!

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

…kow#924) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>

…kawrakow#926) Co-authored-by: firecoperana <firecoperana>

This commit enables IQK quantization operations on ARM-based systems, specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell. Changes: - Enable IQK_IMPLEMENT macro for ARM NEON operations - Add arm_neon.h header include for ARM SIMD intrinsics - Fix compilation errors related to missing NEON types and functions Build requirements for ARM: cmake .. -DGGML_CUDA=ON \ -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \ -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16" Tested on: - Platform: NVIDIA DGX Spark (aarch64) - CPU: GB10 Grace Blackwell Superchip - Memory: 128GB unified memory Fixes build errors: - 'float32x4_t' does not name a type - 'vld1q_f32' was not declared in this scope - 'v_expf' was not declared in this scope - Missing FP16 NEON intrinsics

* Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

…ine args (ikawrakow#932) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>

* Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

This reverts commit caffc04.

…k phase..." This reverts commit 4b22e1e.

…nk phase..." This reverts commit 41ba6cd.

…k phase..." This reverts commit d6bcc43.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix ring reduction * Actually enable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: firecoperana <firecoperana>

* Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

…gml-cuda.dll) Attempt to address Thireus/GGUF-Tool-Suite#41 > I primarily want ik_llama backend builds split back into 3 pieces ( ggml-base.dll, ggml-cpu.dll, and ggml-cuda.dll ) so they can be more easily monkey-patched into LM Studio. Which change do I need to make in my code for this to hapen. Tell me which change I need to make in which file and where in that file. I'll copy paste what you give me. Fix numerous LNK2019 and LNK2001 "unresolved external symbol" errors Fix unresolved external symbols Revert "Fix unresolved external symbols" This reverts commit 20fae5b. Revert "Fix numerous LNK2019 and LNK2001 "unresolved external symbol" errors" This reverts commit 8857398. Revert "ggml.dll split back into 3 pieces (ggml-base.dll, ggml-cpu.dll, and ggml-cuda.dll)" This reverts commit a546f9e. Another ggml.dll split attempt Bugfix attempt include the high-level ggml sources mainline keeps in ggml-base Test something else Another bugfix attempt Revert "Another bugfix attempt" This reverts commit 81d30bc. Revert "Test something else" This reverts commit fccfe19. Revert "include the high-level ggml sources mainline keeps in ggml-base" This reverts commit 3c38b71. Fix missing ggml/src/ggml-impl.h Fix ggml_table_f32_f16 issue Update ggml-backend-reg.cpp Update ggml-backend-reg.cpp Update ggml-backend-reg.cpp Fix LNK1248: image size exceeds maximum allowable size Update CMakeLists.txt Update CMakeLists.txt Update CMakeLists.txt Update CMakeLists.txt

…l, and ggml-cuda.dll)" This reverts commit 8d4c2d4.

* Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Better VRAM utilization strategy for split mode graph * Fix assert when --max-gpu is less than available GPUs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Thireus and others added 30 commits November 7, 2025 17:13

Merge branch 'ikawrakow:main' into main

ecb8c86

Disable add + fused_rms_norm fusion (ikawrakow#916)

d62e8c5

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

86606c2

Adopt fix from mainline PR 17089 (ikawrakow#920)

55576c9

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

523d182

CUDA: fuse copies to K and V cache (ikawrakow#921)

e5fc02c

* Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

ffe321f

CUDA MoE improvements (ikawrakow#923)

9207a48

* Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

server: bug fix for preserved_tokens not preserved in process_token (i…

ff4c1c6

…kawrakow#926) Co-authored-by: firecoperana <firecoperana>

Fix compiler warning

0db683e

Merge branch 'ikawrakow:main' into main

fc87e37

Make biased gemv fusion optional (ikawrakow#931)

db3bed2

* Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

5953499

Use fused gemv+add only for TG (ikawrakow#933)

ad688e1

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

ca115a9

DeepSeek TG optimizations for TG (ikawrakow#928)

7747000

* Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

DeepSeek FA optimizations (ikawrakow#929)

a313b71

* Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

cd521dc

Add support for SmolLM3 (ikawrakow#934)

e4145c0

* Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Minor: remove unnecesssary calls to build_inp_out_ids (ikawrakow#935)

489554b

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

d200340

Add rcache to llama-bench (ikawrakow#936)

1e6f8ff

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Add --n-cpu-moe to llama_bench (ikawrakow#937)

5e7f671

* Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Opt from ikawrakow#880 also for iqk cuda gemv (ikawrakow#938)

463c694

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Enable fusion by default (ikawrakow#939)

9ecfee6

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

09c61e1

Set mla=3 by default (ikawrakow#943)

8a8de91

so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

ikawrakow and others added 18 commits January 5, 2026 08:00

Mimo-V2-Flash support (ikawrakow#1096)

8a6622e

* Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Fix race in CUDA FA for head sizes 192/128 (ikawrakow#1104)

9c866df

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Graph parallel for Mimo-V2-Flash (ikawrakow#1105)

cac2b04

* WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Split mode graph for Qwen3 (ikawrakow#1106)

359cf81

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Split mode 'graph' fpr Qwen3-VL (ikawrakow#1107)

d923639

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Revert "Attempt to fix const char * issue for Windows builds"

f38c91e

This reverts commit caffc04.

Revert "Bugfix attempt: A single input file is required for a non-lin…

036a296

…k phase..." This reverts commit 4b22e1e.

Reapply "Bugfix attempt: A single input file is required for a non-li…

35e82b9

…nk phase..." This reverts commit 41ba6cd.

Revert "Bugfix attempt: A single input file is required for a non-lin…

b8a7fbb

…k phase..." This reverts commit d6bcc43.

Merge branch 'ikawrakow:main' into main

9f30670

Disable ring reduction for now (ikawrakow#1112)

8e9d2c3

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Fix ring reduction (ikawrakow#1114)

6bf4ffe

* Fix ring reduction * Actually enable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Enable up to 4 GPUs for Mimo2-Flash (ikawrakow#1115)

3c91353

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Split mode "graph" for Hunyuan-MoE (ikawrakow#1116)

8e9d66c

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

CUDA: compress-mode size (ikawrakow#1110)

1b24192

Co-authored-by: firecoperana <firecoperana>

Split mode "graph" for GPT-OSS (ikawrakow#1118)

d581d75

* Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Do not abort on NCCL initizalization failure (ikawrakow#1120)

0c2d924

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Thireus force-pushed the main branch from eead478 to 8d4c2d4 Compare January 8, 2026 09:45

Thireus and others added 8 commits January 8, 2026 09:46

Revert "ggml.dll split back into 3 pieces (ggml-base.dll, ggml-cpu.dl…

79543a3

…l, and ggml-cuda.dll)" This reverts commit 8d4c2d4.

Merge branch 'ikawrakow:main' into main

8b3c052

Split mode "graph" for Ernie-4.5-MoE (ikawrakow#1121)

145e4f4

* Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

ee5993d

Fix data races in the reduce op (ikawrakow#1124)

a58a6a8

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

7869c56

Better VRAM utilization strategy for split mode graph (ikawrakow#1126)

d14c479

* Better VRAM utilization strategy for split mode graph * Fix assert when --max-gpu is less than available GPUs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Merge branch 'ikawrakow:main' into main

d73fe98

ikawrakow closed this Jan 10, 2026

ikawrakow force-pushed the main branch from 3419c78 to 738dc60 Compare January 10, 2026 15:49

Thireus deleted the main branch January 11, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Windows Build Workflow (GitHub Actions)#976

Add Windows Build Workflow (GitHub Actions)#976
Thireus wants to merge 4535 commits intoikawrakow:mainfrom
Thireus:main

Thireus commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

Thireus commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants