Skip to content

feat: JANG implementation#364

Open
AlexTzk wants to merge 83 commits intojundot:mainfrom
AlexTzk:main
Open

feat: JANG implementation#364
AlexTzk wants to merge 83 commits intojundot:mainfrom
AlexTzk:main

Conversation

@AlexTzk
Copy link
Copy Markdown

@AlexTzk AlexTzk commented Mar 24, 2026


Summary

This PR adds support for JANG quantized models to oMLX. JANG is an Apple Silicon MLX-based architecture that improves model quality when using low-bit quantization (e.g., 2-bit) by applying mixed-precision quantization — preserving critical attention layers at higher precision (6–8 bit) while quantizing expert MLP layers to 2-bit.

Background

On MoE models, attention is only 1–5% of parameters but controls 100% of coherence. Standard MLX quantization compresses everything equally, which breaks low-bit models. JANG solves this by:

  • Attention: 6–8 bit (preserves coherence)
  • Expert MLP: 2–4 bit (95%+ of params, can absorb errors)

Why

JANG quantization is quite useful for running modern large language models on Apple Silicon hardware. Without it, MoE models with 256+ experts either crash or produce NaNs below 4-bit quantization due to attention layers being compressed too aggressively. This enables models like Qwen3.5-397B (512 experts) and Nemotron-3-Super-120B to run efficiently on M-series Macs at ~2-3 bytes per parameter, making high-capacity reasoning models accessible without enterprise-grade GPU infrastructure. The integration ensures oMLX users can leverage these state-of-the-art quantized models through oMLX.

Changes

File Changes
omlx/engine/jang.py Created JANGLoader class — full JANG model loader with Nemotron-H support, bfloat16 for large models, VLM support
omlx/model_discovery.py Added JANG model detection via jang_config.json
omlx/exceptions.py Added JANGLoadError and JANGDependencyError exceptions
omlx/engine_pool.py Added engine_type == "jang" integration
omlx/server.py Added oq_manager to ServerState
pyproject.toml Added jang[mlx]>=0.1.0 dependency
packaging/venvstacks.toml Added jang-tools dependency
omlx/admin/routes.py Minor import fix
Tests Added comprehensive JANG tests (test_jang_vlm.py, test_model_discovery.py)

Files Changed

File Lines
.gitignore +1 (JANG config)
omlx/admin/routes.py 5
omlx/engine/jang.py 675 (new file)
omlx/engine_pool.py 10
omlx/model_discovery.py 63
omlx/server.py 1
packaging/venvstacks.toml 2
pyproject.toml 2
tests/integration/test_e2e_streaming.py 5
tests/test_admin_auth.py 7
tests/test_jang_vlm.py 248 (new test file)
tests/test_model_discovery.py 121

Total: 12 files changed, 1,130 insertions(+), 10 deletions(−)

Testing

  • tests/test_jang_vlm.py — JANG VLM model loading tests
  • tests/test_model_discovery.py — JANG model detection tests
  • All streaming tests pass

@AlexTzk AlexTzk mentioned this pull request Mar 24, 2026
@wsantos
Copy link
Copy Markdown

wsantos commented Mar 24, 2026

@AlexTzk I think you have to rebase, I'm seeing a lot of fixes for 19->20 version here?

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 24, 2026

@AlexTzk I'm doing some test and looks good so far, I'm impressed with the speed 53-60 tok/s, with 32k context window

image

But I noticed that the cache says 0.0% not sure if it's not collecting metrics or not using it at all
benchmark:

image

I'm going to run some tests with opencode to check pp/s

  • We might need a new parser? I'm seeing some <dcp-message-id>m0003</dcp-message-id> on the thinking stage

@Fail-Safe
Copy link
Copy Markdown

++ @jjang-ai

@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 24, 2026

@wsantos Thank you kindly for taking the time to run my code and provide feedback!
I will test the cache on my end as well with the JANG models. At the moment I have queued a bunch of benchmarks, I am quantifying the benefit of the JANG architecture vs MLX vs oQ (oMLX).

On Qwen3.5-35b MLX 4 bit vs JANG 4k the performance increase is noticeable:

image

But on Minimax 2.5 MLX 3bit vs JANG 2L, the story is not as consistent:

image

The Minimax benchmark is one of the best results the JANG architecture seems to bring -according to the website - which is why I wanted to validate the result for myself.

image

Benchmarks:

  • MMLU — 1000 questions
  • Truthful QA — 817 questions
  • HumanEval — 164 questions
  • LiveCodeBench — 100 tasks
  • MBPP — 200 tasks

I will continue my research but may have to stop performing the full tests as this takes too much time.

@AlexTzk AlexTzk changed the title JANG implementation feat: JANG implementation Mar 24, 2026
@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 25, 2026

@wsantos have rebased the branch and fixed all the conflicts. I also fixed a bug in jang.py that would fail to load Nemotron models. Implementation is completed and ready for merging.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 25, 2026

@jundot I saw on the other issue that you don't want this kind of implementation, but this makes big modes usable on "lowend" machines, e.g: I cannot run 35B-A3B0-4bit on my 32GB and work at the same time, with this PR this is possible, maybe we could have a plugin system instead so this could be implemented out side and easy to integrate?, let me know how do you wanna proceed and I can testing the PR for you if you want it.

@0xClandestine
Copy link
Copy Markdown

wen merge

@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 26, 2026

More results from the JANG models. I must say, very impressive results from the Super model but even the Cascade model still outperforms the oQ variant.

These benchmarks are done on:

  • MMLU — 1000 questions
  • Truthful QA — 817 questions
  • HumanEval — 164 questions
  • LiveCodeBench — 100 tasks
  • MBPP — 200 tasks

I wanted to test and make sure the performance increase is there and it is. The community would benefit from this architecture. It outperforms the 8-bit mlx NANO variant too. The nemotron variants are also incredibly fast.

image

If anyone wants the actual JSON benchmark outputs I can post them somewhere.

AlexTzk added 5 commits March 26, 2026 13:24
- New JANGLoader class in omlx/engine/jang.py
- Integrates with jang-tools package for loading JANG quantized models
- Handles Nemotron-H weight renaming (up_proj -> fc1/fc2)
- Auto-switches to bfloat16 for large expert models (512+ experts)
- VLM model support with load_jang_vlm_model
- Implements all BaseEngine abstract methods
- Create JANGLoader class in omlx/engine/jang.py for JANG model loading
- Integrate JANGLoader with EnginePool via engine_type == "jang" branch
- Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2)
- Auto-switch to bfloat16 for large expert models (512+ experts)
- VLM support via load_jang_vlm_model
- jang-tools dependency check with clear error message
@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 26, 2026

The PR has been updated - yet again - and is ready for merging.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 27, 2026

@AlexTzk It nost working I did another test and got:

odel.language_model.layers.9.mlp.switch_mlp.up_proj.weight.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1160, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 95, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 153, in simple_response
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 517, in __call__
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 130, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 116, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 670, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 324, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 1799, in create_chat_completion
    engine = await get_engine_for_model(request.model)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 673, in get_engine_for_model
    return await get_engine(model, EngineType.LLM)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 598, in get_engine
    engine = await pool.get_engine(model_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine_pool.py", line 398, in get_engine
    await self._load_engine(model_id, force_lm=force_lm)
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine_pool.py", line 575, in _load_engine
    await engine.start()
  File "/opt/homebrew/Cellar/python@3.11/3.11.15/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine/batched.py", line 146, in _load_model_sync
    return load(
           ^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx_lm/utils.py", line 491, in load
    model, config = load_model(model_path, lazy, model_config=model_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx_lm/utils.py", line 415, in load_model
    model.load_weights(list(weights.items()), strict=strict)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 191, in load_weights
    raise ValueError(f"Missing {num_missing} parameters: \n{missing}.")
ValueError: Missing 660 parameters:
language_model.lm_head.weight,
language_model.model.embed_tokens.weight,
language_model.model.layers.0.input_layernorm.weight,

@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 27, 2026

@wsantos which model are you loading? Please also make sure you are up to date with my branch, there were a couple of bugs i fixed in the last commits.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 27, 2026

@wsantos which model are you loading? Please also make sure you are up to date with my branch, there were a couple of bugs i fixed in the last commits.

Hot cache: 1.0GB (in-memory)
2026-03-26 21:25:53,427 - omlx.server - INFO - CORS origins: ['*']
2026-03-26 21:25:53,427 - omlx.model_settings - INFO - Loaded settings for 18 models
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Nemotron-Cascade-2-30B-A3B-JANG_2L (type: llm, engine: jang, size: 10.79GB)
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-27B-JANG_4S (type: vlm, engine: jang, size: 16.68GB)
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-35B-A3B-JANG_2S (type: vlm, engine: jang, size: 5.24GB)
2026-03-26 21:25:53,429 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-35B-A3B-JANG_4K (type: vlm, engine: jang, size: 10.44GB)
2026-03-26 21:25:53,429 - omlx.model_discovery - DEBUG - Skipping models--mlx-community--Qwen3.5-27B-4bit: no config.json found (not a model or organization folder)
2026-03-26 21:25:53,429 - omlx.model_discovery - DEBUG - Skipping models--mlx-community--Qwen3.5-35B-A3B-4bit: no config.json found (not a model or organization folder)
2026-03-26 21:25:53,429 - omlx.engine_pool - WARNING - Pinned model not found: Qwen3.5-35B-A3B-3bit
2026-03-26 21:25:53,429 - omlx.engine_pool - INFO - Discovered 4 models, max memory: 23.04GB
2026-03-26 21:25:53,429 - omlx.server - WARNING - Default model 'Qwen3.5-35B-A3B-4bit' not found, using first model
2026-03-26 21:25:53,429 - omlx.server_metrics - INFO - Loaded all-time stats from /Users/waldecirsantos/.omlx/stats.json
2026-03-26 21:25:53,430 - omlx.server - INFO - Server initialized with 4 models
2026-03-26 21:25:53,430 - omlx.server - INFO - Default model: Nemotron-Cascade-2-30B-A3B-JANG_2L
2026-03-26 21:25:53,430 - omlx.server - INFO - Max model memory: 23.04GB
2026-03-26 21:25:53,430 - omlx.server - INFO - Default max tokens: 32768
2026-03-26 21:25:53,430 - omlx.server - INFO - API key authentication: enabled
2026-03-26 21:25:53,430 - omlx.server - INFO - HF Downloader initialized
2026-03-26 21:25:53,481 - omlx.server - INFO - ModelScope SDK not installed, MS downloader disabled
2026-03-26 21:25:53,482 - omlx.server - INFO - oQ Quantizer initialized
2026-03-26 21:25:53,482 - omlx.server - INFO - HF Uploader initialized
Starting server at http://127.0.0.1:8000
2026-03-26 21:25:53,483 - asyncio - DEBUG - Using selector: KqueueSelector
INFO:     Started server process [51773]
INFO:     Waiting for application startup.
2026-03-26 21:25:53,486 - omlx.process_memory_enforcer - INFO - Metal memory limit set: 28.0GB, cache limit: 14.0GB
2026-03-26 21:25:53,486 - omlx.process_memory_enforcer - INFO - Process memory enforcer started (limit: 25.6GB, interval: 1.0s)
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
2026-03-26 21:26:07,604 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): api.github.com:443
2026-03-26 21:26:07,838 - urllib3.connectionpool - DEBUG - https://api.github.com:443 "GET /repos/jundot/omlx/releases/latest HTTP/1.1" 200 1554
2026-03-26 21:26:19,827 - omlx.model_settings - INFO - Updated settings for model 'Qwen3.5-35B-A3B-JANG_4K'
2026-03-26 21:26:19,829 - omlx.model_settings - DEBUG - Saved settings for 19 models
2026-03-26 21:26:23,091 - omlx.model_settings - INFO - Updated settings for model 'Qwen3.5-35B-A3B-JANG_4K'
2026-03-26 21:26:23,093 - omlx.model_settings - DEBUG - Saved settings for 19 models
2026-03-26 21:26:37,360 - omlx.server - DEBUG - Chat completion request received: model=Qwen3.5-35B-A3B-JANG_4K, messages=1, stream=True, max_tokens=None, temp=None
2026-03-26 21:26:37,361 - omlx.engine_pool - INFO - Loading model: Qwen3.5-35B-A3B-JANG_4K
2026-03-26 21:26:37,680 - torchao - DEBUG - Skipping import of cpp extensions: operator torchao::_linear_8bit_act_1bit_weight does not exist
W0326 21:26:37.744000 51773 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2026-03-26 21:26:37,773 - torchao.kernel.intmm - WARNING - Warning: Detected no triton, on systems without Triton certain kernels will not work
2026-03-26 21:26:38,011 - omlx.engine.vlm - DEBUG - Removed video_processor from MODALITY_TO_AUTOPROCESSOR_MAPPING
2026-03-26 21:26:38,054 - omlx.engine_pool - WARNING - VLM loading failed for Qwen3.5-35B-A3B-JANG_4K, falling back to LLM: Received 219 parameters not in model:
model.language_model.layers.0.mlp.switch_mlp.gate_proj.biases,
model.language_model.layers.0.mlp.switch_mlp.gate_proj.scales,
model.language_model.layers.0.mlp.switch_mlp.gate_proj.weight,

both 35B, I'll clean up everything including cache and try again.

Introduce the jang-tools engine supporting mixed-precision quantization
(attention 6-8-bit, experts 2-4-bit) with VLM capabilities. Includes
model discovery updates, engine pool integration, and comprehensive tests.
@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 27, 2026

@wsantos you were totally right, there was a bug with Qwen3.5 35B. I fixed it now, rebased on the latest release and fixed another bug with mistral.

@jundot - are you planning to integrate this work? It seems that a few people, including myself, see the benefit of this architecture. Whilst I do agree this quant implementation would be better at the mlx level, sometimes you have to let the community make their own choices. If there is anything I can address, kindly let me know because playing cat & mouse here it's not a luxury I can afford to do daily.

@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 28, 2026

@AlexTzk sorry for the late response on this. It's not that i was ignoring this specific PR or anything like that. I just haven't been able to review any PRs during weekdays at all. As i mentioned in this discussion, oMLX is a personal project i work on outside of my main job, and my availability is pretty limited. I hope you understand.


I pulled your branch yesterday (before the last bug patch) and tried loading a JANG model to test things out. It failed to load with errors at that point. Looks like the loading issue itself is resolved now after your latest fixes.

However, the benchmark numbers you posted seem to differ quite a bit from what i actually measured. I'm guessing those were tested against a much earlier version of oQ. Here are my actual results on Nemotron-Cascade-2-30B-A3B:

Model Size MMLU WINOGRANDE HUMANEVAL MBPP
Original (unquantized) 58.8 GB 68.1% (681/1000) 60.0% (760/1267) 81.1% (133/164) 68.3% (205/300)
JANG_4M 17.0 GB 67.7% (677/1000) 58.9% (746/1267) 79.9% (131/164) 65.0% (195/300)
oQ4e 17.3 GB 68.0% (680/1000) 59.4% (752/1267) 81.7% (134/164) 68.0% (204/300)
JANG_2L 10.3 GB 61.3% (613/1000) 51.9% (658/1267) 75.0% (123/164) 60.3% (181/300)
oQ2e 10.4 GB 59.8% (598/1000) 52.8% (669/1267) 75.0% (123/164) 59.3% (178/300)

Could you re-run your benchmarks against the current versions and update the numbers? The results i'm seeing don't match what was posted earlier.


Looking at these benchmark results honestly, i'm not sure if the additional dependency (jang-tools) and custom metal kernel justify a dedicated engine integration, when the quality delta over standard quants is marginal at best.

I want to be clear that this is not me saying "use oQ instead." As i mentioned in a previous issue, i don't think it's a good thing for the MLX ecosystem to have platform-specific quants that only work on a specific inference server. My position on this hasn't changed. Unsloth and GGUF iQ quant are both platform-agnostic quants that work without requiring any changes to the underlying platform. I believe MLX should go the same direction.

Under that premise, if JANG quantization were supported upstream in mlx-lm itself, i would have zero issues with it. The concern is about adding a platform-locked dependency into oMLX specifically.


On the code side, i did spot a few things that might need some adjustments (type annotations, engine dispatch logic, some unused methods, etc). But let's settle the discussion above first. Once you share your thoughts on the platform dependency question, i'll go over the code details with you.

Code review notes (for reference)

Bugs

  • EngineType literal in model_discovery.py and EngineEntry in engine_pool.py don't include "jang". Setting engine_type = "jang" breaks the type system.
  • The VLM branch in _load_engine() was changed from effective_type == "vlm" to entry.engine_type == "vlm". This bypasses the force_lm mechanism that allows VLM models to fall back to LLM loading.
  • server.py:1040 will raise IndexError if model_dirs is an empty list (model_dirs[0]).

Dead code (~25% of jang.py)

  • _load_jang_vlm_manual() (185 lines), _is_vlm_model() (55 lines), _patch_auto_image_processor() (35 lines) are all defined but never called. The start() method comments even note the VLM path "can corrupt weights."

Unrelated changes bundled in

  • oq_manager addition to ServerState and the init_server() parameter cleanup are separate fixes that shouldn't be in a JANG PR.

detect_model_type() ordering change affects all models

  • The new preprocessor_config.json check runs for every model, not just JANG ones. If any embedding or reranker model has a preprocessor_config.json, it will now incorrectly be classified as VLM. The vision_config check is also duplicated 3 times.

Testing gaps

  • No tests for core logic: _infer_bits_and_group_size(), _fix_jang_quantized_bits(), _fix_nemotron_h_weights(), or any generation flow. The main test file (test_jang_vlm.py) primarily tests _is_vlm_model() which is dead code.

@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 28, 2026

@jundot - thanks for coming back to us.

The results I posted were from the builds/artifacts I had at the time, not intended as a final verdict against every latest oQ revision. I agree the benchmark should be rerun on the current versions.

Before I do that, could you share the exact download link for the Nemotron-Cascade-2-30B-A3B Cascade model you tested, and/or the exact FP MLX baseline you used? I can’t find the full-precision MLX artifact on my side, and I want to make sure I’m reproducing your setup exactly.

Once I have the same artifacts, I’m happy to rerun and post the raw JSON outputs.

Alternatively, I can run your oQ4 if you upload it on HF.

Regarding the code, I acknowledge it's not in perfect shape, I was at the point where I wasn't sure if it was still worth my time to rectify it. We'll cross that bridge when we get there.

I do appreciate your efforts and dedication to this platform, it's my favourite Apple LLM server hence why I am keen on contributing.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 28, 2026

@jundot I'd like to reproduce it if you provide either the models or how to reproduce it. I'm not sure what oQ4e means is that the same as oQ+ ?

@jundot
Copy link
Copy Markdown
Owner

jundot commented Mar 28, 2026

@AlexTzk @wsantos

The original (unquantized) model i used for testing is downloaded directly from this repo (58.8GB):
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B

Note that loading an original model through mlx-lm can use up to 2x memory, so keep that in mind.

I also uploaded the re-quantized models using the latest version:

(you can get the same results by quantizing with Enhanced mode yourself using 0.2.24 or 0.2.23)

About Q4e — it's a renamed version of Q4+. Turns out HuggingFace doesn't allow special characters in repo names, so i couldn't upload it as Q4+. The "e" stands for "enhanced". My mistake for the confusion there.

Always happy to answer any quantization questions, whether it's about oQ or anything else. Everyone wants to run high-performance models on limited memory, so i appreciate you taking the time to look into this!

@AlexTzk — you calling oMLX your favourite Apple LLM server genuinely made my day as the creator. I also know this PR took a lot of effort to put together. Always grateful for your contributions.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 28, 2026

@AlexTzk @wsantos

The original (unquantized) model i used for testing is downloaded directly from this repo (58.8GB): huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B

Note that loading an original model through mlx-lm can use up to 2x memory, so keep that in mind.

I also uploaded the re-quantized models using the latest version:

(you can get the same results by quantizing with Enhanced mode yourself using 0.2.24 or 0.2.23)

About Q4e — it's a renamed version of Q4+. Turns out HuggingFace doesn't allow special characters in repo names, so i couldn't upload it as Q4+. The "e" stands for "enhanced". My mistake for the confusion there.

Always happy to answer any quantization questions, whether it's about oQ or anything else. Everyone wants to run high-performance models on limited memory, so i appreciate you taking the time to look into this!

@AlexTzk — you calling oMLX your favourite Apple LLM server genuinely made my day as the creator. I also know this PR took a lot of effort to put together. Always grateful for your contributions.

If you have some spare resources, could you please do oQ4e, oQ3e, oQ2e? I'm trying to fit a good model on my "limited RAM" and I did some benchmarks with this version on JANG too

You already done it, I'm downloading and testing them, ty~

jundot and others added 28 commits March 29, 2026 16:35
Thread mlx-lm's XTC (eXclude Top Choices) sampling parameters
through the full request pipeline. XTC was the only mlx-lm
sampler missing from the omlx API surface.

- Add xtc_probability and xtc_threshold fields to SamplingParams
  dataclass (default 0.0 and 0.1 respectively)
- Default xtc_threshold to 0.1 instead of upstream's 0.0 to
  prevent destructive sampling when only probability is set
  (upstream threshold=0.0 excludes all tokens except the least
  probable one)
- Add optional xtc_probability and xtc_threshold to both
  ChatCompletionRequest and CompletionRequest API models
- Extend get_sampling_params() to resolve XTC values with the
  same request > default priority as other sampling params
- Thread XTC params through chat_kwargs dicts and direct engine
  calls across all API endpoints (chat, completion, anthropic
  messages, responses)
- Extract XTC params from kwargs in BatchedEngine and
  VLMBatchedEngine SamplingParams construction
- Pass xtc_probability, xtc_threshold, and xtc_special_tokens
  to both make_sampler() call sites in the scheduler
- Add _get_xtc_special_tokens() helper to Scheduler, delegating
  to _get_stop_tokens() for EOS coverage and caching the result
  at init time
- Add 10 new tests covering defaults, passthrough, API model
  acceptance, and special token derivation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Blightbow <blightbow@users.noreply.github.com>
- _extract_text_from_content_list(): Enhanced to handle edge cases
- extract_text_content(): Add ContentPart list handling for tool messages
  and final safety check to ensure all content is string type
- extract_multimodal_content(): Add ContentPart list handling for tool messages
- extract_harmony_messages(): Add ContentPart list handling for tool and
  assistant messages

Fixes ValueError when messages with content arrays are sent to MLX models.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Why:
Qwen3-VL embedding models can be loaded through mlx-embeddings, but oMLX always used the generic processor(texts, ...) path for embedding requests. For custom processors such as qwen3_vl, that positional call is interpreted as image input, which breaks /v1/embeddings even when the model is explicitly treated as an embedding model.

What:
Detect processors that expose custom embedding input hooks and route embedding requests through prepare_embedding_inputs/prepare_model_inputs instead of the generic tokenizer path. Keep the existing path for standard text processors, and add regression coverage for both compiled and eager execution.
Why:
- add structured multimodal embedding inputs without breaking the existing text input path
- support custom embedding processors that need image-aware input preparation
- keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights

What:
- add an items-based embedding request format for text and image inputs
- route structured items through embedding normalization, engine, and custom processor preparation
- count usage from prepared multimodal inputs and preserve empty-string text items
- extend tests for multimodal requests, custom processors, and native loading validation
…piled fallback

- remove unused is_likely_local_image_path() and its import os
- reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value
- narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)
eliminates mxfp8 mode and gs=32 special case for 8-bit tensors.
all quantized layers now use affine mode with gs=64, reducing
Metal kernel combo count from 7 to ~5 for oQ4 MoE models.
* fix: update menubar content in real time while menu is open

Two root causes were fixed:

1. NSTimer was registered under NSDefaultRunLoopMode only, which is
   suspended while the status-bar menu is open (macOS enters
   NSEventTrackingRunLoopMode). Changed to NSRunLoopCommonModes so
   healthCheck_ fires even during menu interaction.

2. _build_menu() replaces the entire NSMenu object via setMenu_(), which
   would close a currently-visible menu. Instead, when the menu is open
   we now call _refresh_menu_in_place(), which mutates the existing
   NSMenuItem objects in place (status header attributed title, server
   control button visibility via setHidden_, Admin Panel / Chat enabled
   state).

Additional changes:
- Adopt NSMenuDelegate; menuWillOpen_ always refreshes items before the
  menu is shown, so reopening also reflects the latest state instantly.
- menuDidClose_ clears the _menu_is_open flag so _build_menu() is used
  for full rebuilds (stats submenu, etc.) when the menu is closed.
- Server control section now always adds all three items (Stop / Force
  Restart / Start) and uses setHidden_ to show the relevant one, enabling
  in-place toggling without menu replacement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review feedback on menubar real-time update PR

Four issues raised in jundot#426:

1. Skip _fetch_stats() when menu is open to avoid blocking the main
   thread. _fetch_stats() makes up to 3 synchronous HTTP requests with
   2s timeouts each, which could stall the UI for ~6s during menu
   event tracking. Stats are fetched on the next healthCheck_ cycle
   after the menu closes.

2. Add ServerStatus.STOPPING to the Stop Server button visibility
   condition in both _build_menu() and _refresh_menu_in_place(). The
   button was hidden during the STOPPING transition, leaving no control
   visible to the user.

3. Restore original button order: Force Restart appears before Stop
   Server (Force Restart is the primary action when UNRESPONSIVE).
   The previous commit had them in the wrong order.

4. Sync icon template state in _refresh_menu_in_place() by calling
   setTemplate_(True) on the Admin Panel and Chat icons after updating
   their enabled state, keeping icon rendering consistent with
   _build_menu().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: EmotionalAmo <emotionalamo@EmotionalAmos-MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- add oq_manager/hf_uploader fields to ServerState dataclass
- update KVCache reconstruct test to expect tensor shape offset
- update oQ predicate bits test for affine-only mode
- rewrite metal limit tests to match no-op behavior (jundot#429)
- fix memory fallback test mock to patch HAS_MLX
mlx-vlm v0.4.2 custom processors may have apply_chat_template method
but no chat_template set, raising ValueError instead of TypeError.
fall back to processor.tokenizer which holds the actual template.
- Create JANGLoader class in omlx/engine/jang.py for JANG model loading
- Integrate JANGLoader with EnginePool via engine_type == "jang" branch
- Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2)
- Auto-switch to bfloat16 for large expert models (512+ experts)
- VLM support via load_jang_vlm_model
- jang-tools dependency check with clear error message
@AlexTzk
Copy link
Copy Markdown
Author

AlexTzk commented Mar 29, 2026

@wsantos - I benchmarked the nemotron cascade 2 oQe with:

  • MMLU — 1000 questions
  • Truthful QA — 817 questions
  • HumanEval — 164 questions
  • LiveCodeBench — 100 tasks
  • MBPP — 200 tasks

Very impressive result with Qwen3.5!

I have fixed the bugs and brought the implementation up to date.

@wsantos
Copy link
Copy Markdown

wsantos commented Mar 30, 2026

@wsantos - I benchmarked the nemotron cascade 2 oQe with:

  • MMLU — 1000 questions
  • Truthful QA — 817 questions
  • HumanEval — 164 questions
  • LiveCodeBench — 100 tasks
  • MBPP — 200 tasks

Very impressive result with Qwen3.5!

I have fixed the bugs and brought the implementation up to date.

TY, I'll play with it this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.