Hi, I know you said this isn't a place specifically for ONNX but MNN, but I'd still like to bring it to your attention, maybe it helps. I noticed it when I had the chance to test it today.
llmexport --path ./Qwen3-VL-2B-Instruct \
--dst_path ./qwen3_onnx \
--export onnx \
--quant_bit 4 \
--lm_quant_bit 8
I tried to run very simple inference with image input, After prompt
<im_start>assistant\n, first token should predict words with probabilities like:
- "The": ~40%
- "This": ~20%
- "A": ~15%
But actual Behavior:
First token predictions are:
- " " (space): 13%
- "," (comma): 11%
- "\n" (newline): 9%
- "a": 3%
So the problem is model generates repetitive things ** (commas, spaces, newlines) instead of coherent text. I also experimented with the source code to solve this, but I couldn't achieve, yet...
I've ruled out inference-side issues:
- Visual encoder works correctly (outputs valid embeddings)
- Position IDs constructed correctly for mrope (3D coordinates)
- Deepstack features extracted and applied to correct layers (5, 11, 17)
- KV cache grows properly
- Attention masks correct
- No NaN/Inf values in weights
- NNX graph structure looks correct (no FakeLinear nodes remain)
So the bug manifests probably before generation** - first token probabilities after prefill are already wrong?
I have some guesses about the problems, but I'm not sure. Still, assume I'm thinking out loud.
Weight transpose during FakeLinear→MatMul conversion in onnx_rebuilder.py
MRoPE export - Qwen3-VL uses 3D rotary position encoding which may not export correctly
Deepstack integration - auxiliary visual features injection into transformer layers
I also modified the code to correctly apply deepstack to layers 5, 11, 17 (In llmexport/llmexport.py (forward method):):
|
if deepstack_embeds is not None and i in range(deepstack_embeds.shape[0]): |
|
hidden_states += deepstack_embeds[i] |
with
if deepstack_embeds is not None:
if hasattr(self, 'visual') and self.visual is not None and hasattr(self.visual, 'deepstack_visual_indexes'):
if i in self.visual.deepstack_visual_indexes:
idx = self.visual.deepstack_visual_indexes.index(i)
if idx < deepstack_embeds.shape[0]:
hidden_states += deepstack_embeds[idx]
also changed deepstack shape from [3, 1, hidden_size] to [3, seq_len, hidden_size]: to test it:
|
# add deepstack_embeds input |
|
deepstack_embeds = torch.randn(3, 1, self.hidden_size) |
and experimented with the dynamic axis
qwen3_dynamic_axes = self.model_dynamic_axes.copy()
qwen3_dynamic_axes['deepstack_embeds'] = {1: 'seq_len'}
But unfortunately, these changes didn't fix the issue.
Thank you very much for your contributions!
Hi, I know you said this isn't a place specifically for ONNX but MNN, but I'd still like to bring it to your attention, maybe it helps. I noticed it when I had the chance to test it today.
I tried to run very simple inference with image input, After prompt
<im_start>assistant\n, first token should predict words with probabilities like:But actual Behavior:
First token predictions are:
So the problem is model generates repetitive things ** (commas, spaces, newlines) instead of coherent text. I also experimented with the source code to solve this, but I couldn't achieve, yet...
I've ruled out inference-side issues:
So the bug manifests probably before generation** - first token probabilities after prefill are already wrong?
I have some guesses about the problems, but I'm not sure. Still, assume I'm thinking out loud.
Weight transpose during FakeLinear→MatMul conversion in
onnx_rebuilder.pyMRoPE export - Qwen3-VL uses 3D rotary position encoding which may not export correctly
Deepstack integration - auxiliary visual features injection into transformer layers
I also modified the code to correctly apply deepstack to layers 5, 11, 17 (In
llmexport/llmexport.py(forward method):):llm-export/llmexport/llmexport.py
Lines 404 to 405 in c742d29
with
also changed deepstack shape from [3, 1, hidden_size] to [3, seq_len, hidden_size]: to test it:
llm-export/llmexport/llmexport.py
Lines 826 to 827 in c742d29
and experimented with the dynamic axis
But unfortunately, these changes didn't fix the issue.
Thank you very much for your contributions!