Add generalist-queried cross-attention fusion for MoE CLIP experts #26

theoschiff · 2025-12-08T17:17:19Z

This PR adds a new fusion method, fusion_method="cross_attn", for the MoE image modalities (MOEImageModality and MOEImageModalityPEP) based on generalist-queried cross-attention:

Introduces a reusable CrossAttention module.
Uses the generalist CLIP (defined as last expert in configs) as the query.
Uses specialist CLIPs as key–value context, weighted by the gating network.
Keeps sequence length constant (same number of patches as a single CLIP).

This is exposed via the config flag fusion_method="cross_attn" and cross_attn_heads in both MoE configs.

What changed

New CrossAttention module
- Standard multi-head cross-attention with Q/K/V projections, dropout and output projection.
- Supports masking and can be reused by both MoE variants.
- Shape-safe helper _shape to handle [B, T, C] → [B, H, T, D].
MOEImageModality
- Added fusion_method="cross_attn" path:
  - Stack expert outputs → [B, E, P, C].
  - Treat last expert as generalist (g_idx = -1).
  - Use generalist patch tokens as queries: q = stacked[:, g_idx, :, :] # [B, P, C].
  - Use all non-generalist experts as specialist context.
  - Align gating weights via _gating_to_expert_perm, select specialists, and softmax over them.
  - Scale each specialist’s tokens by its gating weight before passing them as KV to CrossAttention.
- Keeps output shape [B, P, C], then projects with the existing MLPProjector.
MOEImageModalityPEP
- Same cross_attn fusion logic, but:
  - Projects per expert first (PEP), so cross-attention operates in the shared hidden_size space.
  - Reuses the same generalist-as-query, specialists-as-context pattern.

Comparison to existing fusion strategies:

vs. sequence_append:
- sequence_append linearly increases sequence length with the number of experts, which is expensive for the LLM (quadratic attention cost, more memory).
- cross_attn keeps the same number of tokens as a single CLIP, so it’s much more scalable while still leveraging multiple experts.
vs. weighted_average:
- Simple averaging is destructive: it merges all expert features per patch into a single vector, making it hard to preserve complementary information.
- cross_attn lets the generalist CLIP decide per patch which specialists to attend to, using multi-head attention instead of a single scalar weight. This is strictly more expressive and less likely to wash out useful specialist signals.

Why “generalist CLIP as query” is a good inductive bias:

The generalist CLIP is trained to be robust across many domains, so using it as the query anchor keeps the final representation aligned with a strong, general embedding space.
Specialists contribute contextual refinements via keys/values, modulated by the gating network. This naturally matches the intuition: “generalist defines what we’re looking for, specialists provide how to refine it.”

In short:

We keep the robustness and global semantics of the generalist CLIP while letting gated specialists refine each patch via cross-attention, with no sequence length blow-up and strictly more expressive fusion than a weighted average.

Notes

Assumes the last expert is the generalist; this is now baked into the cross-attention path (g_idx = -1).
Cross-attention debug prints can be removed or guarded behind a debug flag once the method is fully validated.
add ablation results comparing:
- sequence_append
- weighted_average
- cross_attn (generalist query) across benchmarks.

src/multimeditron/model/modalities/image_modality_moe.py

implement Cross Attention as new experts fusion method

97996bc

MichelDucartier reviewed Dec 9, 2025

View reviewed changes

src/multimeditron/model/modalities/image_modality_moe.py Outdated Show resolved Hide resolved

theoschiff added 3 commits December 10, 2025 15:51

Added generalist index in config

c3a40b5

add doc and assertions

b3fa6f2

removed debug prints

747d2df

MichelDucartier merged commit 46b5772 into master Dec 10, 2025
1 check failed

MichelDucartier deleted the add-cross-attention branch December 10, 2025 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add generalist-queried cross-attention fusion for MoE CLIP experts #26

Add generalist-queried cross-attention fusion for MoE CLIP experts #26

Uh oh!

theoschiff commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add generalist-queried cross-attention fusion for MoE CLIP experts #26

Add generalist-queried cross-attention fusion for MoE CLIP experts #26

Uh oh!

Conversation

theoschiff commented Dec 8, 2025

What changed

Comparison to existing fusion strategies:

Why “generalist CLIP as query” is a good inductive bias:

Notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants