feat: JANG implementation by AlexTzk · Pull Request #364 · jundot/omlx

AlexTzk · 2026-03-24T03:56:24Z

Summary

This PR adds support for JANG quantized models to oMLX. JANG is an Apple Silicon MLX-based architecture that improves model quality when using low-bit quantization (e.g., 2-bit) by applying mixed-precision quantization — preserving critical attention layers at higher precision (6–8 bit) while quantizing expert MLP layers to 2-bit.

Background

On MoE models, attention is only 1–5% of parameters but controls 100% of coherence. Standard MLX quantization compresses everything equally, which breaks low-bit models. JANG solves this by:

Attention: 6–8 bit (preserves coherence)
Expert MLP: 2–4 bit (95%+ of params, can absorb errors)

Why

JANG quantization is quite useful for running modern large language models on Apple Silicon hardware. Without it, MoE models with 256+ experts either crash or produce NaNs below 4-bit quantization due to attention layers being compressed too aggressively. This enables models like Qwen3.5-397B (512 experts) and Nemotron-3-Super-120B to run efficiently on M-series Macs at ~2-3 bytes per parameter, making high-capacity reasoning models accessible without enterprise-grade GPU infrastructure. The integration ensures oMLX users can leverage these state-of-the-art quantized models through oMLX.

Changes

File	Changes
`omlx/engine/jang.py`	Created `JANGLoader` class — full JANG model loader with Nemotron-H support, bfloat16 for large models, VLM support
`omlx/model_discovery.py`	Added JANG model detection via `jang_config.json`
`omlx/exceptions.py`	Added `JANGLoadError` and `JANGDependencyError` exceptions
`omlx/engine_pool.py`	Added `engine_type == "jang"` integration
`omlx/server.py`	Added `oq_manager` to `ServerState`
`pyproject.toml`	Added `jang[mlx]>=0.1.0` dependency
`packaging/venvstacks.toml`	Added `jang-tools` dependency
`omlx/admin/routes.py`	Minor import fix
Tests	Added comprehensive JANG tests (`test_jang_vlm.py`, `test_model_discovery.py`)

Files Changed

File	Lines
`.gitignore`	+1 (JANG config)
`omlx/admin/routes.py`	5
`omlx/engine/jang.py`	675 (new file)
`omlx/engine_pool.py`	10
`omlx/model_discovery.py`	63
`omlx/server.py`	1
`packaging/venvstacks.toml`	2
`pyproject.toml`	2
`tests/integration/test_e2e_streaming.py`	5
`tests/test_admin_auth.py`	7
`tests/test_jang_vlm.py`	248 (new test file)
`tests/test_model_discovery.py`	121

Total: 12 files changed, 1,130 insertions(+), 10 deletions(−)

Testing

tests/test_jang_vlm.py — JANG VLM model loading tests
tests/test_model_discovery.py — JANG model detection tests
All streaming tests pass

wsantos · 2026-03-24T13:34:55Z

@AlexTzk I think you have to rebase, I'm seeing a lot of fixes for 19->20 version here?

wsantos · 2026-03-24T14:31:04Z

@AlexTzk I'm doing some test and looks good so far, I'm impressed with the speed 53-60 tok/s, with 32k context window

~~But I noticed that the cache says 0.0% not sure if it's not collecting metrics or not using it at all~~
benchmark:

I'm going to run some tests with opencode to check pp/s

We might need a new parser? I'm seeing some <dcp-message-id>m0003</dcp-message-id> on the thinking stage

Fail-Safe · 2026-03-24T14:49:52Z

++ @jjang-ai

AlexTzk · 2026-03-24T18:41:08Z

@wsantos Thank you kindly for taking the time to run my code and provide feedback!
I will test the cache on my end as well with the JANG models. At the moment I have queued a bunch of benchmarks, I am quantifying the benefit of the JANG architecture vs MLX vs oQ (oMLX).

On Qwen3.5-35b MLX 4 bit vs JANG 4k the performance increase is noticeable:

But on Minimax 2.5 MLX 3bit vs JANG 2L, the story is not as consistent:

The Minimax benchmark is one of the best results the JANG architecture seems to bring -according to the website - which is why I wanted to validate the result for myself.

Benchmarks:

MMLU — 1000 questions
Truthful QA — 817 questions
HumanEval — 164 questions
LiveCodeBench — 100 tasks
MBPP — 200 tasks

I will continue my research but may have to stop performing the full tests as this takes too much time.

AlexTzk · 2026-03-25T05:10:14Z

@wsantos have rebased the branch and fixed all the conflicts. I also fixed a bug in jang.py that would fail to load Nemotron models. Implementation is completed and ready for merging.

wsantos · 2026-03-25T16:13:18Z

@jundot I saw on the other issue that you don't want this kind of implementation, but this makes big modes usable on "lowend" machines, e.g: I cannot run 35B-A3B0-4bit on my 32GB and work at the same time, with this PR this is possible, maybe we could have a plugin system instead so this could be implemented out side and easy to integrate?, let me know how do you wanna proceed and I can testing the PR for you if you want it.

0xClandestine · 2026-03-26T14:49:48Z

wen merge

AlexTzk · 2026-03-26T19:35:27Z

More results from the JANG models. I must say, very impressive results from the Super model but even the Cascade model still outperforms the oQ variant.

These benchmarks are done on:

MMLU — 1000 questions
Truthful QA — 817 questions
HumanEval — 164 questions
LiveCodeBench — 100 tasks
MBPP — 200 tasks

I wanted to test and make sure the performance increase is there and it is. The community would benefit from this architecture. It outperforms the 8-bit mlx NANO variant too. The nemotron variants are also incredibly fast.

If anyone wants the actual JSON benchmark outputs I can post them somewhere.

- New JANGLoader class in omlx/engine/jang.py - Integrates with jang-tools package for loading JANG quantized models - Handles Nemotron-H weight renaming (up_proj -> fc1/fc2) - Auto-switches to bfloat16 for large expert models (512+ experts) - VLM model support with load_jang_vlm_model - Implements all BaseEngine abstract methods

- Create JANGLoader class in omlx/engine/jang.py for JANG model loading - Integrate JANGLoader with EnginePool via engine_type == "jang" branch - Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2) - Auto-switch to bfloat16 for large expert models (512+ experts) - VLM support via load_jang_vlm_model - jang-tools dependency check with clear error message

…y tests to include jang

…vstacks

AlexTzk · 2026-03-26T20:29:33Z

The PR has been updated - yet again - and is ready for merging.

…config_files

wsantos · 2026-03-27T02:31:08Z

@AlexTzk It nost working I did another test and got:

odel.language_model.layers.9.mlp.switch_mlp.up_proj.weight.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1160, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 95, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 153, in simple_response
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 517, in __call__
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 130, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 116, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 670, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 324, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 1799, in create_chat_completion
    engine = await get_engine_for_model(request.model)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 673, in get_engine_for_model
    return await get_engine(model, EngineType.LLM)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/server.py", line 598, in get_engine
    engine = await pool.get_engine(model_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine_pool.py", line 398, in get_engine
    await self._load_engine(model_id, force_lm=force_lm)
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine_pool.py", line 575, in _load_engine
    await engine.start()
  File "/opt/homebrew/Cellar/python@3.11/3.11.15/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/omlx/omlx/engine/batched.py", line 146, in _load_model_sync
    return load(
           ^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx_lm/utils.py", line 491, in load
    model, config = load_model(model_path, lazy, model_config=model_config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx_lm/utils.py", line 415, in load_model
    model.load_weights(list(weights.items()), strict=strict)
  File "/Users/waldecirsantos/projetos/models/.venv/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 191, in load_weights
    raise ValueError(f"Missing {num_missing} parameters: \n{missing}.")
ValueError: Missing 660 parameters:
language_model.lm_head.weight,
language_model.model.embed_tokens.weight,
language_model.model.layers.0.input_layernorm.weight,

AlexTzk · 2026-03-27T03:25:45Z

@wsantos which model are you loading? Please also make sure you are up to date with my branch, there were a couple of bugs i fixed in the last commits.

wsantos · 2026-03-27T13:10:21Z

@wsantos which model are you loading? Please also make sure you are up to date with my branch, there were a couple of bugs i fixed in the last commits.

Hot cache: 1.0GB (in-memory)
2026-03-26 21:25:53,427 - omlx.server - INFO - CORS origins: ['*']
2026-03-26 21:25:53,427 - omlx.model_settings - INFO - Loaded settings for 18 models
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Nemotron-Cascade-2-30B-A3B-JANG_2L (type: llm, engine: jang, size: 10.79GB)
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-27B-JANG_4S (type: vlm, engine: jang, size: 16.68GB)
2026-03-26 21:25:53,428 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-35B-A3B-JANG_2S (type: vlm, engine: jang, size: 5.24GB)
2026-03-26 21:25:53,429 - omlx.model_discovery - INFO - Discovered model: Qwen3.5-35B-A3B-JANG_4K (type: vlm, engine: jang, size: 10.44GB)
2026-03-26 21:25:53,429 - omlx.model_discovery - DEBUG - Skipping models--mlx-community--Qwen3.5-27B-4bit: no config.json found (not a model or organization folder)
2026-03-26 21:25:53,429 - omlx.model_discovery - DEBUG - Skipping models--mlx-community--Qwen3.5-35B-A3B-4bit: no config.json found (not a model or organization folder)
2026-03-26 21:25:53,429 - omlx.engine_pool - WARNING - Pinned model not found: Qwen3.5-35B-A3B-3bit
2026-03-26 21:25:53,429 - omlx.engine_pool - INFO - Discovered 4 models, max memory: 23.04GB
2026-03-26 21:25:53,429 - omlx.server - WARNING - Default model 'Qwen3.5-35B-A3B-4bit' not found, using first model
2026-03-26 21:25:53,429 - omlx.server_metrics - INFO - Loaded all-time stats from /Users/waldecirsantos/.omlx/stats.json
2026-03-26 21:25:53,430 - omlx.server - INFO - Server initialized with 4 models
2026-03-26 21:25:53,430 - omlx.server - INFO - Default model: Nemotron-Cascade-2-30B-A3B-JANG_2L
2026-03-26 21:25:53,430 - omlx.server - INFO - Max model memory: 23.04GB
2026-03-26 21:25:53,430 - omlx.server - INFO - Default max tokens: 32768
2026-03-26 21:25:53,430 - omlx.server - INFO - API key authentication: enabled
2026-03-26 21:25:53,430 - omlx.server - INFO - HF Downloader initialized
2026-03-26 21:25:53,481 - omlx.server - INFO - ModelScope SDK not installed, MS downloader disabled
2026-03-26 21:25:53,482 - omlx.server - INFO - oQ Quantizer initialized
2026-03-26 21:25:53,482 - omlx.server - INFO - HF Uploader initialized
Starting server at http://127.0.0.1:8000
2026-03-26 21:25:53,483 - asyncio - DEBUG - Using selector: KqueueSelector
INFO:     Started server process [51773]
INFO:     Waiting for application startup.
2026-03-26 21:25:53,486 - omlx.process_memory_enforcer - INFO - Metal memory limit set: 28.0GB, cache limit: 14.0GB
2026-03-26 21:25:53,486 - omlx.process_memory_enforcer - INFO - Process memory enforcer started (limit: 25.6GB, interval: 1.0s)
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
2026-03-26 21:26:07,604 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): api.github.com:443
2026-03-26 21:26:07,838 - urllib3.connectionpool - DEBUG - https://api.github.com:443 "GET /repos/jundot/omlx/releases/latest HTTP/1.1" 200 1554
2026-03-26 21:26:19,827 - omlx.model_settings - INFO - Updated settings for model 'Qwen3.5-35B-A3B-JANG_4K'
2026-03-26 21:26:19,829 - omlx.model_settings - DEBUG - Saved settings for 19 models
2026-03-26 21:26:23,091 - omlx.model_settings - INFO - Updated settings for model 'Qwen3.5-35B-A3B-JANG_4K'
2026-03-26 21:26:23,093 - omlx.model_settings - DEBUG - Saved settings for 19 models
2026-03-26 21:26:37,360 - omlx.server - DEBUG - Chat completion request received: model=Qwen3.5-35B-A3B-JANG_4K, messages=1, stream=True, max_tokens=None, temp=None
2026-03-26 21:26:37,361 - omlx.engine_pool - INFO - Loading model: Qwen3.5-35B-A3B-JANG_4K
2026-03-26 21:26:37,680 - torchao - DEBUG - Skipping import of cpp extensions: operator torchao::_linear_8bit_act_1bit_weight does not exist
W0326 21:26:37.744000 51773 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
2026-03-26 21:26:37,773 - torchao.kernel.intmm - WARNING - Warning: Detected no triton, on systems without Triton certain kernels will not work
2026-03-26 21:26:38,011 - omlx.engine.vlm - DEBUG - Removed video_processor from MODALITY_TO_AUTOPROCESSOR_MAPPING
2026-03-26 21:26:38,054 - omlx.engine_pool - WARNING - VLM loading failed for Qwen3.5-35B-A3B-JANG_4K, falling back to LLM: Received 219 parameters not in model:
model.language_model.layers.0.mlp.switch_mlp.gate_proj.biases,
model.language_model.layers.0.mlp.switch_mlp.gate_proj.scales,
model.language_model.layers.0.mlp.switch_mlp.gate_proj.weight,

both 35B, I'll clean up everything including cache and try again.

Introduce the jang-tools engine supporting mixed-precision quantization (attention 6-8-bit, experts 2-4-bit) with VLM capabilities. Includes model discovery updates, engine pool integration, and comprehensive tests.

AlexTzk · 2026-03-27T22:29:02Z

@wsantos you were totally right, there was a bug with Qwen3.5 35B. I fixed it now, rebased on the latest release and fixed another bug with mistral.

@jundot - are you planning to integrate this work? It seems that a few people, including myself, see the benefit of this architecture. Whilst I do agree this quant implementation would be better at the mlx level, sometimes you have to let the community make their own choices. If there is anything I can address, kindly let me know because playing cat & mouse here it's not a luxury I can afford to do daily.

jundot · 2026-03-28T03:18:02Z

@AlexTzk sorry for the late response on this. It's not that i was ignoring this specific PR or anything like that. I just haven't been able to review any PRs during weekdays at all. As i mentioned in this discussion, oMLX is a personal project i work on outside of my main job, and my availability is pretty limited. I hope you understand.

I pulled your branch yesterday (before the last bug patch) and tried loading a JANG model to test things out. It failed to load with errors at that point. Looks like the loading issue itself is resolved now after your latest fixes.

However, the benchmark numbers you posted seem to differ quite a bit from what i actually measured. I'm guessing those were tested against a much earlier version of oQ. Here are my actual results on Nemotron-Cascade-2-30B-A3B:

Model	Size	MMLU	WINOGRANDE	HUMANEVAL	MBPP
Original (unquantized)	58.8 GB	68.1% (681/1000)	60.0% (760/1267)	81.1% (133/164)	68.3% (205/300)
JANG_4M	17.0 GB	67.7% (677/1000)	58.9% (746/1267)	79.9% (131/164)	65.0% (195/300)
oQ4e	17.3 GB	68.0% (680/1000)	59.4% (752/1267)	81.7% (134/164)	68.0% (204/300)
JANG_2L	10.3 GB	61.3% (613/1000)	51.9% (658/1267)	75.0% (123/164)	60.3% (181/300)
oQ2e	10.4 GB	59.8% (598/1000)	52.8% (669/1267)	75.0% (123/164)	59.3% (178/300)

Could you re-run your benchmarks against the current versions and update the numbers? The results i'm seeing don't match what was posted earlier.

Looking at these benchmark results honestly, i'm not sure if the additional dependency (jang-tools) and custom metal kernel justify a dedicated engine integration, when the quality delta over standard quants is marginal at best.

I want to be clear that this is not me saying "use oQ instead." As i mentioned in a previous issue, i don't think it's a good thing for the MLX ecosystem to have platform-specific quants that only work on a specific inference server. My position on this hasn't changed. Unsloth and GGUF iQ quant are both platform-agnostic quants that work without requiring any changes to the underlying platform. I believe MLX should go the same direction.

Under that premise, if JANG quantization were supported upstream in mlx-lm itself, i would have zero issues with it. The concern is about adding a platform-locked dependency into oMLX specifically.

On the code side, i did spot a few things that might need some adjustments (type annotations, engine dispatch logic, some unused methods, etc). But let's settle the discussion above first. Once you share your thoughts on the platform dependency question, i'll go over the code details with you.

Code review notes (for reference)

Bugs

EngineType literal in model_discovery.py and EngineEntry in engine_pool.py don't include "jang". Setting engine_type = "jang" breaks the type system.
The VLM branch in _load_engine() was changed from effective_type == "vlm" to entry.engine_type == "vlm". This bypasses the force_lm mechanism that allows VLM models to fall back to LLM loading.
server.py:1040 will raise IndexError if model_dirs is an empty list (model_dirs[0]).

Dead code (~25% of jang.py)

_load_jang_vlm_manual() (185 lines), _is_vlm_model() (55 lines), _patch_auto_image_processor() (35 lines) are all defined but never called. The start() method comments even note the VLM path "can corrupt weights."

Unrelated changes bundled in

oq_manager addition to ServerState and the init_server() parameter cleanup are separate fixes that shouldn't be in a JANG PR.

detect_model_type() ordering change affects all models

The new preprocessor_config.json check runs for every model, not just JANG ones. If any embedding or reranker model has a preprocessor_config.json, it will now incorrectly be classified as VLM. The vision_config check is also duplicated 3 times.

Testing gaps

No tests for core logic: _infer_bits_and_group_size(), _fix_jang_quantized_bits(), _fix_nemotron_h_weights(), or any generation flow. The main test file (test_jang_vlm.py) primarily tests _is_vlm_model() which is dead code.

AlexTzk · 2026-03-28T04:30:45Z

@jundot - thanks for coming back to us.

The results I posted were from the builds/artifacts I had at the time, not intended as a final verdict against every latest oQ revision. I agree the benchmark should be rerun on the current versions.

Before I do that, could you share the exact download link for the Nemotron-Cascade-2-30B-A3B Cascade model you tested, and/or the exact FP MLX baseline you used? I can’t find the full-precision MLX artifact on my side, and I want to make sure I’m reproducing your setup exactly.

Once I have the same artifacts, I’m happy to rerun and post the raw JSON outputs.

Alternatively, I can run your oQ4 if you upload it on HF.

Regarding the code, I acknowledge it's not in perfect shape, I was at the point where I wasn't sure if it was still worth my time to rectify it. We'll cross that bridge when we get there.

I do appreciate your efforts and dedication to this platform, it's my favourite Apple LLM server hence why I am keen on contributing.

wsantos · 2026-03-28T05:21:08Z

@jundot I'd like to reproduce it if you provide either the models or how to reproduce it. I'm not sure what oQ4e means is that the same as oQ+ ?

jundot · 2026-03-28T05:38:11Z

@AlexTzk @wsantos

The original (unquantized) model i used for testing is downloaded directly from this repo (58.8GB):
https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B

Note that loading an original model through mlx-lm can use up to 2x memory, so keep that in mind.

I also uploaded the re-quantized models using the latest version:

(you can get the same results by quantizing with Enhanced mode yourself using 0.2.24 or 0.2.23)

About Q4e — it's a renamed version of Q4+. Turns out HuggingFace doesn't allow special characters in repo names, so i couldn't upload it as Q4+. The "e" stands for "enhanced". My mistake for the confusion there.

Always happy to answer any quantization questions, whether it's about oQ or anything else. Everyone wants to run high-performance models on limited memory, so i appreciate you taking the time to look into this!

@AlexTzk — you calling oMLX your favourite Apple LLM server genuinely made my day as the creator. I also know this PR took a lot of effort to put together. Always grateful for your contributions.

wsantos · 2026-03-28T13:26:37Z

@AlexTzk @wsantos

The original (unquantized) model i used for testing is downloaded directly from this repo (58.8GB): huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B

Note that loading an original model through mlx-lm can use up to 2x memory, so keep that in mind.

I also uploaded the re-quantized models using the latest version:

huggingface.co/Jundot/Nemotron-Cascade-2-30B-A3B-oQ2e

huggingface.co/Jundot/Nemotron-Cascade-2-30B-A3B-oQ4e

(you can get the same results by quantizing with Enhanced mode yourself using 0.2.24 or 0.2.23)

About Q4e — it's a renamed version of Q4+. Turns out HuggingFace doesn't allow special characters in repo names, so i couldn't upload it as Q4+. The "e" stands for "enhanced". My mistake for the confusion there.

Always happy to answer any quantization questions, whether it's about oQ or anything else. Everyone wants to run high-performance models on limited memory, so i appreciate you taking the time to look into this!

@AlexTzk — you calling oMLX your favourite Apple LLM server genuinely made my day as the creator. I also know this PR took a lot of effort to put together. Always grateful for your contributions.

~~If you have some spare resources, could you please do oQ4e, oQ3e, oQ2e? I'm trying to fit a good model on my "limited RAM" and I did some benchmarks with this version on JANG too~~

You already done it, I'm downloading and testing them, ty~

closes jundot#295

Thread mlx-lm's XTC (eXclude Top Choices) sampling parameters through the full request pipeline. XTC was the only mlx-lm sampler missing from the omlx API surface. - Add xtc_probability and xtc_threshold fields to SamplingParams dataclass (default 0.0 and 0.1 respectively) - Default xtc_threshold to 0.1 instead of upstream's 0.0 to prevent destructive sampling when only probability is set (upstream threshold=0.0 excludes all tokens except the least probable one) - Add optional xtc_probability and xtc_threshold to both ChatCompletionRequest and CompletionRequest API models - Extend get_sampling_params() to resolve XTC values with the same request > default priority as other sampling params - Thread XTC params through chat_kwargs dicts and direct engine calls across all API endpoints (chat, completion, anthropic messages, responses) - Extract XTC params from kwargs in BatchedEngine and VLMBatchedEngine SamplingParams construction - Pass xtc_probability, xtc_threshold, and xtc_special_tokens to both make_sampler() call sites in the scheduler - Add _get_xtc_special_tokens() helper to Scheduler, delegating to _get_stop_tokens() for EOS coverage and caching the result at init time - Add 10 new tests covering defaults, passthrough, API model acceptance, and special token derivation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Blightbow <blightbow@users.noreply.github.com>

- _extract_text_from_content_list(): Enhanced to handle edge cases - extract_text_content(): Add ContentPart list handling for tool messages and final safety check to ensure all content is string type - extract_multimodal_content(): Add ContentPart list handling for tool messages - extract_harmony_messages(): Add ContentPart list handling for tool and assistant messages Fixes ValueError when messages with content arrays are sent to MLX models. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

…duplication, add tests

Why: Qwen3-VL embedding models can be loaded through mlx-embeddings, but oMLX always used the generic processor(texts, ...) path for embedding requests. For custom processors such as qwen3_vl, that positional call is interpreted as image input, which breaks /v1/embeddings even when the model is explicitly treated as an embedding model. What: Detect processors that expose custom embedding input hooks and route embedding requests through prepare_embedding_inputs/prepare_model_inputs instead of the generic tokenizer path. Keep the existing path for standard text processors, and add regression coverage for both compiled and eager execution.

Why: - add structured multimodal embedding inputs without breaking the existing text input path - support custom embedding processors that need image-aware input preparation - keep native embedding loading safe by accepting extra HF weights while rejecting missing or shape-incompatible core weights What: - add an items-based embedding request format for text and image inputs - route structured items through embedding normalization, engine, and custom processor preparation - count usage from prepared multimodal inputs and preserve empty-string text items - extend tests for multimodal requests, custom processors, and native loading validation

…piled fallback - remove unused is_likely_local_image_path() and its import os - reset total_tokens to None when compiled embed path fails, so the eager fallback recomputes instead of using a stale value - narrow bare except in _count_prepared_tokens() to (TypeError, ValueError)

eliminates mxfp8 mode and gs=32 special case for 8-bit tensors. all quantized layers now use affine mode with gs=64, reducing Metal kernel combo count from 7 to ~5 for oQ4 MoE models.

* fix: update menubar content in real time while menu is open Two root causes were fixed: 1. NSTimer was registered under NSDefaultRunLoopMode only, which is suspended while the status-bar menu is open (macOS enters NSEventTrackingRunLoopMode). Changed to NSRunLoopCommonModes so healthCheck_ fires even during menu interaction. 2. _build_menu() replaces the entire NSMenu object via setMenu_(), which would close a currently-visible menu. Instead, when the menu is open we now call _refresh_menu_in_place(), which mutates the existing NSMenuItem objects in place (status header attributed title, server control button visibility via setHidden_, Admin Panel / Chat enabled state). Additional changes: - Adopt NSMenuDelegate; menuWillOpen_ always refreshes items before the menu is shown, so reopening also reflects the latest state instantly. - menuDidClose_ clears the _menu_is_open flag so _build_menu() is used for full rebuilds (stats submenu, etc.) when the menu is closed. - Server control section now always adds all three items (Stop / Force Restart / Start) and uses setHidden_ to show the relevant one, enabling in-place toggling without menu replacement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address review feedback on menubar real-time update PR Four issues raised in jundot#426: 1. Skip _fetch_stats() when menu is open to avoid blocking the main thread. _fetch_stats() makes up to 3 synchronous HTTP requests with 2s timeouts each, which could stall the UI for ~6s during menu event tracking. Stats are fetched on the next healthCheck_ cycle after the menu closes. 2. Add ServerStatus.STOPPING to the Stop Server button visibility condition in both _build_menu() and _refresh_menu_in_place(). The button was hidden during the STOPPING transition, leaving no control visible to the user. 3. Restore original button order: Force Restart appears before Stop Server (Force Restart is the primary action when UNRESPONSIVE). The previous commit had them in the wrong order. 4. Sync icon template state in _refresh_menu_in_place() by calling setTemplate_(True) on the Admin Panel and Chat icons after updating their enabled state, keeping icon rendering consistent with _build_menu(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: EmotionalAmo <emotionalamo@EmotionalAmos-MacBook-Pro.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…open

…conflict

- add oq_manager/hf_uploader fields to ServerState dataclass - update KVCache reconstruct test to expect tensor shape offset - update oQ predicate bits test for affine-only mode - rewrite metal limit tests to match no-op behavior (jundot#429) - fix memory fallback test mock to patch HAS_MLX

mlx-vlm v0.4.2 custom processors may have apply_chat_template method but no chat_template set, raising ValueError instead of TypeError. fall back to processor.tokenizer which holds the actual template.

…lict

- Create JANGLoader class in omlx/engine/jang.py for JANG model loading - Integrate JANGLoader with EnginePool via engine_type == "jang" branch - Support Nemotron-H weight renaming (up_proj->fc1, down_proj->fc2) - Auto-switch to bfloat16 for large expert models (512+ experts) - VLM support via load_jang_vlm_model - jang-tools dependency check with clear error message

…y tests to include jang

…vstacks

…config_files

AlexTzk · 2026-03-29T23:42:13Z

@wsantos - I benchmarked the nemotron cascade 2 oQe with:

MMLU — 1000 questions
Truthful QA — 817 questions
HumanEval — 164 questions
LiveCodeBench — 100 tasks
MBPP — 200 tasks

Very impressive result with Qwen3.5!

I have fixed the bugs and brought the implementation up to date.

wsantos · 2026-03-30T00:08:21Z

@wsantos - I benchmarked the nemotron cascade 2 oQe with:

MMLU — 1000 questions

Truthful QA — 817 questions

HumanEval — 164 questions

LiveCodeBench — 100 tasks

MBPP — 200 tasks

Very impressive result with Qwen3.5!

I have fixed the bugs and brought the implementation up to date.

TY, I'll play with it this week

AlexTzk mentioned this pull request Mar 24, 2026

Implement Jang format #339

Closed

AlexTzk changed the title ~~JANG implementation~~ feat: JANG implementation Mar 24, 2026

AlexTzk force-pushed the main branch from 5315b76 to 875e70d Compare March 25, 2026 03:32

AlexTzk added 5 commits March 26, 2026 13:24

add jang to pyproject

106a69e

feat: finish jang implementation, adjust nemotron jang loading, modif…

517806f

…y tests to include jang

feat: jang implementation, fix jang nemotron weights, add jang to ven…

a481035

…vstacks

AlexTzk force-pushed the main branch from f0b7ce3 to a481035 Compare March 26, 2026 20:27

feat: jang implementation - fix model_dir to model_dirs and add jang_…

d9edb63

…config_files

feat: add JANG engine for quantized MoE+SSM hybrid models

2c1e773

Introduce the jang-tools engine supporting mixed-precision quantization (attention 6-8-bit, experts 2-4-bit) with VLM capabilities. Includes model discovery updates, engine pool integration, and comprehensive tests.

jundot force-pushed the main branch from 65b4ef1 to 2e39d71 Compare March 28, 2026 01:20

jundot and others added 28 commits March 29, 2026 16:35

fix: make admin dashboard responsive for mobile devices

1a17fae

closes jundot#295

fix: copy block_ids before storing in prefix index

2be603a

fix: update _prefix_index type annotations to match tuple storage

154b744

follow-up jundot#433: remove safety check catch-all, refactor inline …

c877549

…duplication, add tests

chore: bump mlx-embeddings to 32981fa (v0.1.0)

594e3ee

fix: unify 8-bit quantization to affine/gs64 (remove mxfp8/gs32)

a27b3e5

eliminates mxfp8 mode and gs=32 special case for 8-bit tensors. all quantized layers now use affine mode with gs=64, reducing Metal kernel combo count from 7 to ~5 for oQ4 MoE models.

fix: install mlx-audio with --no-deps to avoid mlx-lm version conflict

51e1d3d

fix: map voice param to instruct for VoiceDesign TTS models

1d047d1

feat: pin mlx-audio to git commit (6408d2a) for latest model support

ad13fd5

fix: deduplicate menu status logic and guard _build_menu during menu …

d6dbad6

…open

fix: build mlx-audio wheel from git separately to bypass uv resolver …

32fb3a9

…conflict

fix: strip torch/cv2/pyarrow/pandas/sympy from app bundle (~780MB saved)

d546f0f

fix: handle ValueError from processor.apply_chat_template in VLM engine

cbf2bed

mlx-vlm v0.4.2 custom processors may have apply_chat_template method but no chat_template set, raising ValueError instead of TypeError. fall back to processor.tokenizer which holds the actual template.

chore: bump version to 0.3.0rc1

f1a153b

fix: add mlx-audio runtime deps to venvstacks and resolve dynlib conf…

092d882

…lict

Add files via upload

ce53097

feat: finish jang implementation, adjust nemotron jang loading, modif…

a81ad6b

…y tests to include jang

feat: jang implementation, fix jang nemotron weights, add jang to ven…

7fa1a24

…vstacks

feat: jang implementation - fix model_dir to model_dirs and add jang_…

19cb689

…config_files

rebase and fix bugs jundot suggested

a77d410

Conversation

AlexTzk commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Why

Changes

Files Changed

Testing

Uh oh!

wsantos commented Mar 24, 2026

Uh oh!

wsantos commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fail-Safe commented Mar 24, 2026

Uh oh!

AlexTzk commented Mar 24, 2026

On Qwen3.5-35b MLX 4 bit vs JANG 4k the performance increase is noticeable:

But on Minimax 2.5 MLX 3bit vs JANG 2L, the story is not as consistent:

Benchmarks:

Uh oh!

AlexTzk commented Mar 25, 2026

Uh oh!

wsantos commented Mar 25, 2026

Uh oh!

0xClandestine commented Mar 26, 2026

Uh oh!

AlexTzk commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexTzk commented Mar 26, 2026

Uh oh!

wsantos commented Mar 27, 2026

Uh oh!

AlexTzk commented Mar 27, 2026

Uh oh!

wsantos commented Mar 27, 2026

Uh oh!

AlexTzk commented Mar 27, 2026

Uh oh!

jundot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexTzk commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsantos commented Mar 28, 2026

Uh oh!

jundot commented Mar 28, 2026

Uh oh!

wsantos commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexTzk commented Mar 29, 2026

Uh oh!

wsantos commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

AlexTzk commented Mar 24, 2026 •

edited

Loading

wsantos commented Mar 24, 2026 •

edited

Loading

AlexTzk commented Mar 26, 2026 •

edited

Loading

jundot commented Mar 28, 2026 •

edited

Loading

AlexTzk commented Mar 28, 2026 •

edited

Loading

wsantos commented Mar 28, 2026 •

edited

Loading