Skip to content

Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/kimi-k25-multimodal
Open

Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/kimi-k25-multimodal

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

Community contrib for Kimi-K2.5 (moonshotai/Kimi-K2.5) — a native multimodal agentic model with a 1T MoE text decoder and 400M MoonViT vision encoder — running on trn2.48xlarge via NxD Inference.

This extends the Kimi-K2 text-only contrib (PR #131) with vision encoder fusion for multimodal image-to-text inference.

Configuration

  • Instance: trn2.48xlarge (TP=64, EP=1, LNC=2)
  • Quantization: FP8 per-channel (experts), BF16 (attention, shared experts)
  • Sequence length: 512 (configurable)
  • SDK: Neuron SDK 2.29 (DLAMI 20260410)

Performance

Metric Value
TKG throughput 46.6 tok/s (21.4 ms TPOT)
CTE latency (with vision) 2.1s
MoonViT encoding 35.5 ms
Model load time ~50 min (from-scratch sharding)

Architecture

  • Text decoder: Kimi-K2 MoE (61 layers, 384 routed experts + shared experts, MLA attention) — reused from PR Add Kimi-K2-Instruct-0905 contrib model (1T MoE on trn2.48xlarge) #131 unchanged
  • Vision encoder: MoonViT-400M (27-layer ViT, real-number 2D RoPE, eager attention, PatchMergerMLP)
  • Fusion: scatter_by_index_put (same Llama4/Pixtral pattern used in NxDI)
  • Checkpoint: K2.5 INT4 compressed-tensors → dequant to BF16 → FP8 per-channel re-quantization
  • Wrapper: K25ImageToTextModelWrapper with non-trivial (ones-like) vision tracing inputs

Files

File Lines Description
src/modeling_kimi_k2.py 1548 K2 text decoder (from PR #131)
src/modeling_kimi_k25.py 1041 K2.5 multimodal: checkpoint loader, vision fusion, config builder
src/moonvit.py 397 MoonViT vision encoder
test/integration/test_model.py 500 5 integration tests
README.md 335 Documentation with usage, benchmarks, compatibility

Tests

All 5 integration tests pass on trn2.48xlarge (SDK 2.29):

  1. Smoke test — model loads successfully
  2. Multimodal generation — generates coherent image descriptions
  3. Vision A/B — real vs zero vision embeddings produce different outputs (max_logit_diff=15.2)
  4. Coherence — output is non-repetitive, descriptive text
  5. TPOT — 21.4 ms/token (46.6 tok/s), well under 100ms threshold

Notes

  • MoonViT runs on CPU (all 64 Neuron cores are consumed by the text decoder at TP=64). Vision embeddings are pre-computed before text decoder load.
  • The K2 text decoder code (modeling_kimi_k2.py) is identical to PR Add Kimi-K2-Instruct-0905 contrib model (1T MoE on trn2.48xlarge) #131. If K2 is merged first, this file can reference it instead of duplicating.
  • Uses --no_on_device_sampling (CPU greedy) due to a separate ODS issue with the K2 architecture.

Kimi-K2.5 (moonshotai/Kimi-K2.5) on trn2.48xlarge via NxDI. Extends
the K2 text decoder (PR aws-neuron#131) with MoonViT-400M vision encoder fusion.

Configuration: TP=64, EP=1, LNC=2, seq_len=512, FP8 per-channel quantized.
Performance: 46.6 tok/s TKG (21.4 ms TPOT), 2.1s CTE with vision.
Tested on Neuron SDK 2.29 (DLAMI 20260410).

Includes:
- K2 text decoder (MLA attention, 384-expert MoE, shared experts)
- K2.5 checkpoint loader (INT4 dequant -> BF16 -> FP8 per-channel)
- Vision fusion via scatter_by_index_put (Llama4 pattern)
- K25ImageToTextModelWrapper with non-trivial tracing inputs
- MoonViT encoder with real-number 2D RoPE
- 5 integration tests (smoke, multimodal gen, vision A/B, coherence, TPOT)
### K2.5 Weight Format

K2.5 uses a different weight format than K2:
- **Expert weights:** INT4 compressed-tensors (pack-quantized, group_size=32, symmetric)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a one-step conversion from int4 to fp8 that we can invent?

…ure)

K2.6 is a post-training update of K2.5 with identical architecture
(only eos_token_id differs: 163585 -> 163586). No source code changes
needed -- NxDI reads eos_token_id from config at load time.

Changes:
- README: Add K2.6 model info, benchmarks (57.9 tok/s TKG, 17.3ms
  TPOT, 1010ms TTFT), compatibility matrix, behavioral notes
- test_model.py: Make paths configurable via env vars (KIMI_MODEL_PATH,
  KIMI_TEXT_MODEL_DIR, KIMI_COMPILED_PATH, KIMI_VISION_EMB_PATH) so
  tests work with either K2.5 or K2.6 checkpoints

Validated on trn2.48xlarge, SDK 2.29, TP=64, EP=1, LNC=2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants