Conversation
|
@AlexTzk I think you have to rebase, I'm seeing a lot of fixes for 19->20 version here? |
|
@AlexTzk I'm doing some test and looks good so far, I'm impressed with the speed 53-60 tok/s, with 32k context window
I'm going to run some tests with opencode to check pp/s
|
|
++ @jjang-ai |
|
@wsantos Thank you kindly for taking the time to run my code and provide feedback! On Qwen3.5-35b MLX 4 bit vs JANG 4k the performance increase is noticeable:
But on Minimax 2.5 MLX 3bit vs JANG 2L, the story is not as consistent:
The Minimax benchmark is one of the best results the JANG architecture seems to bring -according to the website - which is why I wanted to validate the result for myself.
Benchmarks:
I will continue my research but may have to stop performing the full tests as this takes too much time. |
|
@wsantos have rebased the branch and fixed all the conflicts. I also fixed a bug in jang.py that would fail to load Nemotron models. Implementation is completed and ready for merging. |
|
@jundot I saw on the other issue that you don't want this kind of implementation, but this makes big modes usable on "lowend" machines, e.g: I cannot run 35B-A3B0-4bit on my 32GB and work at the same time, with this PR this is possible, maybe we could have a plugin system instead so this could be implemented out side and easy to integrate?, let me know how do you wanna proceed and I can testing the PR for you if you want it. |
|
wen merge |
- New JANGLoader class in omlx/engine/jang.py - Integrates with jang-tools package for loading JANG quantized models - Handles Nemotron-H weight renaming (up_proj -> fc1/fc2) - Auto-switches to bfloat16 for large expert models (512+ experts) - VLM model support with load_jang_vlm_model - Implements all BaseEngine abstract methods
- Create JANGLoader class in omlx/engine/jang.py for JANG model loading - Integrate JANGLoader with EnginePool via engine_type == "jang" branch - Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2) - Auto-switch to bfloat16 for large expert models (512+ experts) - VLM support via load_jang_vlm_model - jang-tools dependency check with clear error message
…y tests to include jang
|
The PR has been updated - yet again - and is ready for merging. |
|
@AlexTzk It nost working I did another test and got: |
|
@wsantos which model are you loading? Please also make sure you are up to date with my branch, there were a couple of bugs i fixed in the last commits. |
both 35B, I'll clean up everything including cache and try again. |
Introduce the jang-tools engine supporting mixed-precision quantization (attention 6-8-bit, experts 2-4-bit) with VLM capabilities. Includes model discovery updates, engine pool integration, and comprehensive tests.
|
@wsantos you were totally right, there was a bug with Qwen3.5 35B. I fixed it now, rebased on the latest release and fixed another bug with mistral. @jundot - are you planning to integrate this work? It seems that a few people, including myself, see the benefit of this architecture. Whilst I do agree this quant implementation would be better at the mlx level, sometimes you have to let the community make their own choices. If there is anything I can address, kindly let me know because playing cat & mouse here it's not a luxury I can afford to do daily. |
|
@AlexTzk sorry for the late response on this. It's not that i was ignoring this specific PR or anything like that. I just haven't been able to review any PRs during weekdays at all. As i mentioned in this discussion, oMLX is a personal project i work on outside of my main job, and my availability is pretty limited. I hope you understand. I pulled your branch yesterday (before the last bug patch) and tried loading a JANG model to test things out. It failed to load with errors at that point. Looks like the loading issue itself is resolved now after your latest fixes. However, the benchmark numbers you posted seem to differ quite a bit from what i actually measured. I'm guessing those were tested against a much earlier version of oQ. Here are my actual results on Nemotron-Cascade-2-30B-A3B:
Could you re-run your benchmarks against the current versions and update the numbers? The results i'm seeing don't match what was posted earlier. Looking at these benchmark results honestly, i'm not sure if the additional dependency ( I want to be clear that this is not me saying "use oQ instead." As i mentioned in a previous issue, i don't think it's a good thing for the MLX ecosystem to have platform-specific quants that only work on a specific inference server. My position on this hasn't changed. Unsloth and GGUF iQ quant are both platform-agnostic quants that work without requiring any changes to the underlying platform. I believe MLX should go the same direction. Under that premise, if JANG quantization were supported upstream in mlx-lm itself, i would have zero issues with it. The concern is about adding a platform-locked dependency into oMLX specifically. On the code side, i did spot a few things that might need some adjustments (type annotations, engine dispatch logic, some unused methods, etc). But let's settle the discussion above first. Once you share your thoughts on the platform dependency question, i'll go over the code details with you. Code review notes (for reference)Bugs
Dead code (~25% of jang.py)
Unrelated changes bundled in
Testing gaps
|
|
@jundot - thanks for coming back to us. The results I posted were from the builds/artifacts I had at the time, not intended as a final verdict against every latest oQ revision. I agree the benchmark should be rerun on the current versions. Before I do that, could you share the exact download link for the Nemotron-Cascade-2-30B-A3B Cascade model you tested, and/or the exact FP MLX baseline you used? I can’t find the full-precision MLX artifact on my side, and I want to make sure I’m reproducing your setup exactly. Once I have the same artifacts, I’m happy to rerun and post the raw JSON outputs. Alternatively, I can run your oQ4 if you upload it on HF. Regarding the code, I acknowledge it's not in perfect shape, I was at the point where I wasn't sure if it was still worth my time to rectify it. We'll cross that bridge when we get there. I do appreciate your efforts and dedication to this platform, it's my favourite Apple LLM server hence why I am keen on contributing. |
|
@jundot I'd like to reproduce it if you provide either the models or how to reproduce it. I'm not sure what oQ4e means is that the same as oQ+ ? |
|
The original (unquantized) model i used for testing is downloaded directly from this repo (58.8GB): Note that loading an original model through mlx-lm can use up to 2x memory, so keep that in mind. I also uploaded the re-quantized models using the latest version:
(you can get the same results by quantizing with Enhanced mode yourself using 0.2.24 or 0.2.23) About Q4e — it's a renamed version of Q4+. Turns out HuggingFace doesn't allow special characters in repo names, so i couldn't upload it as Q4+. The "e" stands for "enhanced". My mistake for the confusion there. Always happy to answer any quantization questions, whether it's about oQ or anything else. Everyone wants to run high-performance models on limited memory, so i appreciate you taking the time to look into this! @AlexTzk — you calling oMLX your favourite Apple LLM server genuinely made my day as the creator. I also know this PR took a lot of effort to put together. Always grateful for your contributions. |
You already done it, I'm downloading and testing them, ty~ |
Thread mlx-lm's XTC (eXclude Top Choices) sampling parameters through the full request pipeline. XTC was the only mlx-lm sampler missing from the omlx API surface. - Add xtc_probability and xtc_threshold fields to SamplingParams dataclass (default 0.0 and 0.1 respectively) - Default xtc_threshold to 0.1 instead of upstream's 0.0 to prevent destructive sampling when only probability is set (upstream threshold=0.0 excludes all tokens except the least probable one) - Add optional xtc_probability and xtc_threshold to both ChatCompletionRequest and CompletionRequest API models - Extend get_sampling_params() to resolve XTC values with the same request > default priority as other sampling params - Thread XTC params through chat_kwargs dicts and direct engine calls across all API endpoints (chat, completion, anthropic messages, responses) - Extract XTC params from kwargs in BatchedEngine and VLMBatchedEngine SamplingParams construction - Pass xtc_probability, xtc_threshold, and xtc_special_tokens to both make_sampler() call sites in the scheduler - Add _get_xtc_special_tokens() helper to Scheduler, delegating to _get_stop_tokens() for EOS coverage and caching the result at init time - Add 10 new tests covering defaults, passthrough, API model acceptance, and special token derivation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
- _extract_text_from_content_list(): Enhanced to handle edge cases - extract_text_content(): Add ContentPart list handling for tool messages and final safety check to ensure all content is string type - extract_multimodal_content(): Add ContentPart list handling for tool messages - extract_harmony_messages(): Add ContentPart list handling for tool and assistant messages Fixes ValueError when messages with content arrays are sent to MLX models. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…duplication, add tests
Why: Qwen3-VL embedding models can be loaded through mlx-embeddings, but oMLX always used the generic processor(texts, ...) path for embedding requests. For custom processors such as qwen3_vl, that positional call is interpreted as image input, which breaks /v1/embeddings even when the model is explicitly treated as an embedding model. What: Detect processors that expose custom embedding input hooks and route embedding requests through prepare_embedding_inputs/prepare_model_inputs instead of the generic tokenizer path. Keep the existing path for standard text processors, and add regression coverage for both compiled and eager execution.
Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation
…piled fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
eliminates mxfp8 mode and gs=32 special case for 8-bit tensors. all quantized layers now use affine mode with gs=64, reducing Metal kernel combo count from 7 to ~5 for oQ4 MoE models.
* fix: update menubar content in real time while menu is open Two root causes were fixed: 1. NSTimer was registered under NSDefaultRunLoopMode only, which is suspended while the status-bar menu is open (macOS enters NSEventTrackingRunLoopMode). Changed to NSRunLoopCommonModes so healthCheck_ fires even during menu interaction. 2. _build_menu() replaces the entire NSMenu object via setMenu_(), which would close a currently-visible menu. Instead, when the menu is open we now call _refresh_menu_in_place(), which mutates the existing NSMenuItem objects in place (status header attributed title, server control button visibility via setHidden_, Admin Panel / Chat enabled state). Additional changes: - Adopt NSMenuDelegate; menuWillOpen_ always refreshes items before the menu is shown, so reopening also reflects the latest state instantly. - menuDidClose_ clears the _menu_is_open flag so _build_menu() is used for full rebuilds (stats submenu, etc.) when the menu is closed. - Server control section now always adds all three items (Stop / Force Restart / Start) and uses setHidden_ to show the relevant one, enabling in-place toggling without menu replacement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address review feedback on menubar real-time update PR Four issues raised in jundot#426: 1. Skip _fetch_stats() when menu is open to avoid blocking the main thread. _fetch_stats() makes up to 3 synchronous HTTP requests with 2s timeouts each, which could stall the UI for ~6s during menu event tracking. Stats are fetched on the next healthCheck_ cycle after the menu closes. 2. Add ServerStatus.STOPPING to the Stop Server button visibility condition in both _build_menu() and _refresh_menu_in_place(). The button was hidden during the STOPPING transition, leaving no control visible to the user. 3. Restore original button order: Force Restart appears before Stop Server (Force Restart is the primary action when UNRESPONSIVE). The previous commit had them in the wrong order. 4. Sync icon template state in _refresh_menu_in_place() by calling setTemplate_(True) on the Admin Panel and Chat icons after updating their enabled state, keeping icon rendering consistent with _build_menu(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: EmotionalAmo <emotionalamo@EmotionalAmos-MacBook-Pro.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- add oq_manager/hf_uploader fields to ServerState dataclass - update KVCache reconstruct test to expect tensor shape offset - update oQ predicate bits test for affine-only mode - rewrite metal limit tests to match no-op behavior (jundot#429) - fix memory fallback test mock to patch HAS_MLX
mlx-vlm v0.4.2 custom processors may have apply_chat_template method but no chat_template set, raising ValueError instead of TypeError. fall back to processor.tokenizer which holds the actual template.
- Create JANGLoader class in omlx/engine/jang.py for JANG model loading - Integrate JANGLoader with EnginePool via engine_type == "jang" branch - Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2) - Auto-switch to bfloat16 for large expert models (512+ experts) - VLM support via load_jang_vlm_model - jang-tools dependency check with clear error message
…y tests to include jang
|
@wsantos - I benchmarked the nemotron cascade 2 oQe with:
Very impressive result with Qwen3.5! I have fixed the bugs and brought the implementation up to date. |
TY, I'll play with it this week |






Summary
This PR adds support for JANG quantized models to oMLX. JANG is an Apple Silicon MLX-based architecture that improves model quality when using low-bit quantization (e.g., 2-bit) by applying mixed-precision quantization — preserving critical attention layers at higher precision (6–8 bit) while quantizing expert MLP layers to 2-bit.
Background
On MoE models, attention is only 1–5% of parameters but controls 100% of coherence. Standard MLX quantization compresses everything equally, which breaks low-bit models. JANG solves this by:
Why
JANG quantization is quite useful for running modern large language models on Apple Silicon hardware. Without it, MoE models with 256+ experts either crash or produce NaNs below 4-bit quantization due to attention layers being compressed too aggressively. This enables models like Qwen3.5-397B (512 experts) and Nemotron-3-Super-120B to run efficiently on M-series Macs at ~2-3 bytes per parameter, making high-capacity reasoning models accessible without enterprise-grade GPU infrastructure. The integration ensures oMLX users can leverage these state-of-the-art quantized models through oMLX.
Changes
omlx/engine/jang.pyJANGLoaderclass — full JANG model loader with Nemotron-H support, bfloat16 for large models, VLM supportomlx/model_discovery.pyjang_config.jsonomlx/exceptions.pyJANGLoadErrorandJANGDependencyErrorexceptionsomlx/engine_pool.pyengine_type == "jang"integrationomlx/server.pyoq_managertoServerStatepyproject.tomljang[mlx]>=0.1.0dependencypackaging/venvstacks.tomljang-toolsdependencyomlx/admin/routes.pytest_jang_vlm.py,test_model_discovery.py)Files Changed
.gitignoreomlx/admin/routes.pyomlx/engine/jang.pyomlx/engine_pool.pyomlx/model_discovery.pyomlx/server.pypackaging/venvstacks.tomlpyproject.tomltests/integration/test_e2e_streaming.pytests/test_admin_auth.pytests/test_jang_vlm.pytests/test_model_discovery.pyTotal: 12 files changed, 1,130 insertions(+), 10 deletions(−)
Testing
tests/test_jang_vlm.py— JANG VLM model loading teststests/test_model_discovery.py— JANG model detection tests