Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145
Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Open
Add Kimi-K2.5 multimodal contrib (1T MoE + MoonViT vision encoder)#145jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Conversation
Kimi-K2.5 (moonshotai/Kimi-K2.5) on trn2.48xlarge via NxDI. Extends the K2 text decoder (PR aws-neuron#131) with MoonViT-400M vision encoder fusion. Configuration: TP=64, EP=1, LNC=2, seq_len=512, FP8 per-channel quantized. Performance: 46.6 tok/s TKG (21.4 ms TPOT), 2.1s CTE with vision. Tested on Neuron SDK 2.29 (DLAMI 20260410). Includes: - K2 text decoder (MLA attention, 384-expert MoE, shared experts) - K2.5 checkpoint loader (INT4 dequant -> BF16 -> FP8 per-channel) - Vision fusion via scatter_by_index_put (Llama4 pattern) - K25ImageToTextModelWrapper with non-trivial tracing inputs - MoonViT encoder with real-number 2D RoPE - 5 integration tests (smoke, multimodal gen, vision A/B, coherence, TPOT)
jeffhataws
reviewed
Apr 27, 2026
| ### K2.5 Weight Format | ||
|
|
||
| K2.5 uses a different weight format than K2: | ||
| - **Expert weights:** INT4 compressed-tensors (pack-quantized, group_size=32, symmetric) |
There was a problem hiding this comment.
Maybe there's a one-step conversion from int4 to fp8 that we can invent?
…ure) K2.6 is a post-training update of K2.5 with identical architecture (only eos_token_id differs: 163585 -> 163586). No source code changes needed -- NxDI reads eos_token_id from config at load time. Changes: - README: Add K2.6 model info, benchmarks (57.9 tok/s TKG, 17.3ms TPOT, 1010ms TTFT), compatibility matrix, behavioral notes - test_model.py: Make paths configurable via env vars (KIMI_MODEL_PATH, KIMI_TEXT_MODEL_DIR, KIMI_COMPILED_PATH, KIMI_VISION_EMB_PATH) so tests work with either K2.5 or K2.6 checkpoints Validated on trn2.48xlarge, SDK 2.29, TP=64, EP=1, LNC=2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Community contrib for Kimi-K2.5 (
moonshotai/Kimi-K2.5) — a native multimodal agentic model with a 1T MoE text decoder and 400M MoonViT vision encoder — running on trn2.48xlarge via NxD Inference.This extends the Kimi-K2 text-only contrib (PR #131) with vision encoder fusion for multimodal image-to-text inference.
Configuration
Performance
Architecture
scatter_by_index_put(same Llama4/Pixtral pattern used in NxDI)K25ImageToTextModelWrapperwith non-trivial (ones-like) vision tracing inputsFiles
src/modeling_kimi_k2.pysrc/modeling_kimi_k25.pysrc/moonvit.pytest/integration/test_model.pyREADME.mdTests
All 5 integration tests pass on trn2.48xlarge (SDK 2.29):
Notes
modeling_kimi_k2.py) is identical to PR Add Kimi-K2-Instruct-0905 contrib model (1T MoE on trn2.48xlarge) #131. If K2 is merged first, this file can reference it instead of duplicating.--no_on_device_sampling(CPU greedy) due to a separate ODS issue with the K2 architecture.