Add Windows Build Workflow (GitHub Actions)#976
Closed
Thireus wants to merge 4535 commits intoikawrakow:mainfrom
Closed
Add Windows Build Workflow (GitHub Actions)#976Thireus wants to merge 4535 commits intoikawrakow:mainfrom
Thireus wants to merge 4535 commits intoikawrakow:mainfrom
Conversation
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…kow#924) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>
…kawrakow#926) Co-authored-by: firecoperana <firecoperana>
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.
Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions
Build requirements for ARM:
cmake .. -DGGML_CUDA=ON \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"
Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory
Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
* Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…ine args (ikawrakow#932) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>
* Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This reverts commit caffc04.
…k phase..." This reverts commit 4b22e1e.
…nk phase..." This reverts commit 41ba6cd.
…k phase..." This reverts commit d6bcc43.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix ring reduction * Actually enable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: firecoperana <firecoperana>
* Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
…gml-cuda.dll) Attempt to address Thireus/GGUF-Tool-Suite#41 > I primarily want ik_llama backend builds split back into 3 pieces ( ggml-base.dll, ggml-cpu.dll, and ggml-cuda.dll ) so they can be more easily monkey-patched into LM Studio. Which change do I need to make in my code for this to hapen. Tell me which change I need to make in which file and where in that file. I'll copy paste what you give me. Fix numerous LNK2019 and LNK2001 "unresolved external symbol" errors Fix unresolved external symbols Revert "Fix unresolved external symbols" This reverts commit 20fae5b. Revert "Fix numerous LNK2019 and LNK2001 "unresolved external symbol" errors" This reverts commit 8857398. Revert "ggml.dll split back into 3 pieces (ggml-base.dll, ggml-cpu.dll, and ggml-cuda.dll)" This reverts commit a546f9e. Another ggml.dll split attempt Bugfix attempt include the high-level ggml sources mainline keeps in ggml-base Test something else Another bugfix attempt Revert "Another bugfix attempt" This reverts commit 81d30bc. Revert "Test something else" This reverts commit fccfe19. Revert "include the high-level ggml sources mainline keeps in ggml-base" This reverts commit 3c38b71. Fix missing ggml/src/ggml-impl.h Fix ggml_table_f32_f16 issue Update ggml-backend-reg.cpp Update ggml-backend-reg.cpp Update ggml-backend-reg.cpp Fix LNK1248: image size exceeds maximum allowable size Update CMakeLists.txt Update CMakeLists.txt Update CMakeLists.txt Update CMakeLists.txt
…l, and ggml-cuda.dll)" This reverts commit 8d4c2d4.
* Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better VRAM utilization strategy for split mode graph * Fix assert when --max-gpu is less than available GPUs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First, thank you for maintaining this project — it has been very useful, and I appreciate the work that has gone into it.
I initially created a fork to add automated Windows builds for my own use, since I needed ready-to-use binaries. Since this is functionality that could benefit other users as well, I’m submitting this pull request so the Windows build workflow can live directly in the main repository instead of in my fork.
This PR includes:
Builds must be manually triggered (I believe they could be automated after each commit, but I have not had the chance to dig into it). There is also a bit of cleanup to do, specifically regarding other automated jobs that run to conduct meaningless checks. The original code was obtained from mainline with some tweaking to adapt it to ik_llama.cpp.
My goal was to make the project more accessible to Windows users, specifically users who do not find the time to set up a dev env or don't have the knowledge to do it on Windows, such as myself. If you’d prefer changes to structure, naming, workflow triggers, or anything else, I’m happy to adjust the PR accordingly.
Thanks again for the project and for taking the time to review this!