Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder) by jimburtoft · Pull Request #145 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-26T06:49:30Z

Summary

Community contrib for Kimi-K2.5 (moonshotai/Kimi-K2.5) — a native multimodal agentic model with a 1T MoE text decoder and 400M MoonViT vision encoder — running on trn2.48xlarge via NxD Inference.

This extends the Kimi-K2 text-only contrib (PR #131) with vision encoder fusion for multimodal image-to-text inference.

Configuration

Instance: trn2.48xlarge (TP=64, EP=1, LNC=2)
Quantization: FP8 per-channel (experts), BF16 (attention, shared experts)
Sequence length: 512 (configurable)
SDK: Neuron SDK 2.29 (DLAMI 20260410)

Performance

Metric	Value
TKG throughput	46.6 tok/s (21.4 ms TPOT)
CTE latency (with vision)	2.1s
MoonViT encoding	35.5 ms
Model load time	~50 min (from-scratch sharding)

Architecture

Text decoder: Kimi-K2 MoE (61 layers, 384 routed experts + shared experts, MLA attention) — reused from PR Add Kimi-K2-Instruct-0905 contrib model (1T MoE on trn2.48xlarge) #131 unchanged
Vision encoder: MoonViT-400M (27-layer ViT, real-number 2D RoPE, eager attention, PatchMergerMLP)
Fusion: scatter_by_index_put (same Llama4/Pixtral pattern used in NxDI)
Checkpoint: K2.5 INT4 compressed-tensors → dequant to BF16 → FP8 per-channel re-quantization
Wrapper: K25ImageToTextModelWrapper with non-trivial (ones-like) vision tracing inputs

Files

File	Lines	Description
`src/modeling_kimi_k2.py`	1548	K2 text decoder (from PR #131)
`src/modeling_kimi_k25.py`	1041	K2.5 multimodal: checkpoint loader, vision fusion, config builder
`src/moonvit.py`	397	MoonViT vision encoder
`test/integration/test_model.py`	500	5 integration tests
`README.md`	335	Documentation with usage, benchmarks, compatibility

Tests

All 5 integration tests pass on trn2.48xlarge (SDK 2.29):

Smoke test — model loads successfully
Multimodal generation — generates coherent image descriptions
Vision A/B — real vs zero vision embeddings produce different outputs (max_logit_diff=15.2)
Coherence — output is non-repetitive, descriptive text
TPOT — 21.4 ms/token (46.6 tok/s), well under 100ms threshold

Notes

MoonViT runs on CPU (all 64 Neuron cores are consumed by the text decoder at TP=64). Vision embeddings are pre-computed before text decoder load.
The K2 text decoder code (modeling_kimi_k2.py) is identical to PR Add Kimi-K2-Instruct-0905 contrib model (1T MoE on trn2.48xlarge) #131. If K2 is merged first, this file can reference it instead of duplicating.
Uses --no_on_device_sampling (CPU greedy) due to a separate ODS issue with the K2 architecture.

Kimi-K2.5 (moonshotai/Kimi-K2.5) on trn2.48xlarge via NxDI. Extends the K2 text decoder (PR aws-neuron#131) with MoonViT-400M vision encoder fusion. Configuration: TP=64, EP=1, LNC=2, seq_len=512, FP8 per-channel quantized. Performance: 46.6 tok/s TKG (21.4 ms TPOT), 2.1s CTE with vision. Tested on Neuron SDK 2.29 (DLAMI 20260410). Includes: - K2 text decoder (MLA attention, 384-expert MoE, shared experts) - K2.5 checkpoint loader (INT4 dequant -> BF16 -> FP8 per-channel) - Vision fusion via scatter_by_index_put (Llama4 pattern) - K25ImageToTextModelWrapper with non-trivial tracing inputs - MoonViT encoder with real-number 2D RoPE - 5 integration tests (smoke, multimodal gen, vision A/B, coherence, TPOT)

jeffhataws · 2026-04-27T17:55:53Z

+### K2.5 Weight Format
+
+K2.5 uses a different weight format than K2:
+- **Expert weights:** INT4 compressed-tensors (pack-quantized, group_size=32, symmetric)


Maybe there's a one-step conversion from int4 to fp8 that we can invent?

…ure) K2.6 is a post-training update of K2.5 with identical architecture (only eos_token_id differs: 163585 -> 163586). No source code changes needed -- NxDI reads eos_token_id from config at load time. Changes: - README: Add K2.6 model info, benchmarks (57.9 tok/s TKG, 17.3ms TPOT, 1010ms TTFT), compatibility matrix, behavioral notes - test_model.py: Make paths configurable via env vars (KIMI_MODEL_PATH, KIMI_TEXT_MODEL_DIR, KIMI_COMPILED_PATH, KIMI_VISION_EMB_PATH) so tests work with either K2.5 or K2.6 checkpoints Validated on trn2.48xlarge, SDK 2.29, TP=64, EP=1, LNC=2.

jeffhataws reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145

Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/kimi-k25-multimodal

jimburtoft commented Apr 26, 2026

Uh oh!

jeffhataws Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimburtoft commented Apr 26, 2026

Summary

Configuration

Performance

Architecture

Files

Tests

Notes

Uh oh!

jeffhataws Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants