feat: Deterministic base-architecture preference for cross-family merges #641

ikhyunAn · 2025-11-04T18:57:39Z

Add deterministic base-architecture preference for cross-family merges

Description

This PR adds deterministic architecture selection for cross-family model merging, ensuring the output model consistently uses the base_model's architecture rather than non-deterministically selecting from referenced models. This PR is an implementation on the feature requested: #640

Problems Solved

1. Non-deterministic Architecture Selection

Previously, when merging models from different families (e.g., Llama + Qwen), MergeConfiguration.referenced_models() used an unordered set, making architecture selection unpredictable. This could cause runtime errors when the selected architecture required tensors not present in the base model.

Error Example:

RuntimeError: Tensor model.layers.0.self_attn.k_norm.weight required but not present 
in model TsinghuaC3I/Llama-3-8B-UltraMedical (HF)

2. Inflexible SLERP Parameter Requirements

SLERP required a t value for every tensor, making selective component merging difficult. This prevented graceful fallback when merging models with incompatible tensor shapes.

Solution

This PR implements two focused changes:

Prefer base_model architecture when multiple known architectures are present
Make SLERP t parameter optional with graceful fallback to base model weights

Changes

1. `mergekit/architecture/init.py` (+9 lines)

Explicitly prefer the base_model's architecture when mixing different families:

# Prefer using the base_model's architecture when mixing different families
# to ensure the output layout matches the base.
if config.base_model is not None:
    try:
        idx = models.index(config.base_model)
        return model_arch_info[idx]
    except ValueError:
        # base_model not in referenced models; fall back to first
        pass
return model_arch_info[0]

2. `mergekit/merge_methods/slerp.py` (+12, -5 lines)

Make t parameter optional with safe fallback:

class SlerpTask(Task[torch.Tensor]):
    t: Optional[float]  # Changed from: t: float
    
    def execute(self, **kwargs) -> torch.Tensor:
        # ... existing validation ...
        
        # If no interpolation parameter was provided for this tensor, do not attempt to merge;
        # simply return the base model's weight unchanged. This avoids shape/broadcast errors
        # when the secondary model has incompatible tensor shapes.
        if self.t is None:
            return tensors[self.base_model]
        
        # ... rest of SLERP logic unchanged ...

class SlerpMerge(MergeMethod):
    def parameters(self) -> List[ConfigParameterDef]:
        return [ConfigParameterDef(name="t", required=False, default_value=None)]

3. Minor Improvements

mergekit/io/tensor_writer.py: Added path normalization for robustness
mergekit/tokenizer/embed.py: Fixed ZeroEmbedding initialization

Impact

Backward Compatibility

Fully backward compatible:

Same-family merges: No behavioral changes
Existing configs with explicit t values: Continue working exactly as before
Cross-family merges: Now deterministic and base-aligned (previously undefined behavior)

Use Cases Enabled

This change enables new merge patterns:

Selective Cross-Family Component Merging:

merge_method: slerp
base_model: meta-llama/Llama-3-8B
slices:
  - sources:
      - model: meta-llama/Llama-3-8B
        layer_range: [5, 6]
      - model: Qwen/Qwen2.5-7B
        layer_range: [10, 11]
    parameters:
      t:
        - filter: mlp.gate_proj.weight
          value: 0.5
        - filter: mlp.up_proj.weight
          value: 0.5
        - filter: mlp.down_proj.weight
          value: 0.5
        # Other tensors (attention, norms) default to base via t=None

Benefits:

Transfer specific capabilities between model families
Map different layer indices (base layer 5 ← merge layer 10)
Preserve base architecture while selectively incorporating components
Safe handling of incompatible tensor shapes

Testing

Unit Tests

Existing tests pass without modification
Architecture selection logic validated
Optional t parameter handling verified

Integration Testing

Successful cross-family merge:

Base: Llama-3-8B-UltraMedical (32 layers, Llama architecture)
Merge: II-Medical-8B (36 layers, Qwen architecture with k_norm)
Layer mappings: 0→0, 16→18, 31→35
Components: Attention only (q_proj, k_proj, v_proj, o_proj)

Results:

Output model: 18GB, 4 safetensor shards
Architecture: LlamaForCausalLM (base), not Qwen
No k_norm tensors (correctly avoided)
Model loads successfully
Deterministic: Same config produces identical results

Test Config

Effective merge config (click to expand)

merge_method: slerp
base_model: TsinghuaC3I/Llama-3-8B-UltraMedical
dtype: bfloat16
slices:
  # Mapped layer: base 0 ← merge 0
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [0, 1]
      - model: Intelligent-Internet/II-Medical-8B
        layer_range: [0, 1]
    parameters:
      t:
        - filter: self_attn.q_proj.weight
          value: 0.3
        - filter: self_attn.k_proj.weight
          value: 0.3
        # ... k/v/o_proj similar
  
  # Unmapped layer: pure base
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [1, 2]
    parameters:
      t:
        - value: 0.0
  
  # ... layers 2-15 similar (base only) ...
  
  # Cross-layer mapped: base 16 ← merge 18
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [16, 17]
      - model: Intelligent-Internet/II-Medical-8B
        layer_range: [18, 19]
    parameters:
      t:
        - filter: self_attn.q_proj.weight
          value: 0.3
        # ... other attention components

Performance

Code size: +21 lines, -5 lines (net +16 lines)
Runtime impact: Negligible (architecture selection happens once at merge start)
Memory impact: None
Merge speed: No change

Checklist

Code follows project style guidelines
Changes are minimal and focused
Backward compatibility maintained
Existing tests pass
Integration test successful (real cross-family merge)
Documentation updated (inline comments)
No breaking changes

Related Issues

Fixes/Addresses: #640

Questions for Reviewers

Architecture preference: Is always preferring base_model acceptable, or should we add a config option like architecture_source: base|first|merge?
Optional t behavior: The current implementation returns base weights when t=None. Is this the desired fallback, or would you prefer a different default?
Alternative naming: Should the fallback be more explicit? e.g., t: "base" instead of t: None?
Documentation: Should we add examples to the main README demonstrating cross-family merges? I have a private repository that has a merge option for cross-model family, layer-wise merging. I will make it available for testing if requested.

Migration Guide

No migration needed - this is a non-breaking change.

Existing configs will work identically. New cross-family merge patterns are opt-in via:

Using base_model in config (recommended for cross-family)
Omitting t values for tensors that should remain from base

Future Enhancements

Potential follow-ups (out of scope for this PR):

Explicit architecture_source config option
Per-slice architecture override
Automatic shape compatibility validation
Support for other merge methods (DARE, TIES, etc.)

Testing Environment:

Python: 3.10+
PyTorch: 2.0+
Hardware: Multi-GPU setup (NVIDIA H200)
Models tested: Llama-3 8B, Qwen 7B variants

Merge Successful: Real-world 18GB model produced and validated

Note

Deterministically selects the base model’s architecture for mixed-family merges, makes SLERP’s t optional with base-weight fallback, and adds path normalization plus zero-embedding fixes.

Architecture:
- Prefer config.base_model’s architecture when multiple known architectures are present in mergekit/architecture/__init__.py, otherwise fall back to the first.
Merge Methods (SLERP):
- Make t optional in mergekit/merge_methods/slerp.py; when t=None, return base model weights without merging to avoid shape issues.
- Update parameter schema to required=False with default_value=None.
Tokenizer/Embeddings:
- Use ZeroEmbedding(kind="zero") for missing tokens and create correct-shaped zero vectors in mergekit/tokenizer/embed.py.
IO:
- Normalize output path with os.path.abspath(os.path.expanduser(...)) in mergekit/io/tensor_writer.py.

^{Written by Cursor Bugbot for commit e85a454. This will update automatically on new commits. Configure here.}

github-actions · 2025-11-04T18:57:51Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

ikhyunAn · 2025-11-04T19:00:04Z

I have read the CLA Document and I hereby sign the CLA

ikhyunAn · 2025-11-04T19:23:50Z

I'm not familiar with pre-commit, but I installed it and ran it on --all-files. "Pending Check" status should be updated

feat: support cross-model family merge

3654aa3

This comment was marked as outdated.

Sign in to view

fix: pre-commit hook

e85a454

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Deterministic base-architecture preference for cross-family merges #641

feat: Deterministic base-architecture preference for cross-family merges #641

Uh oh!

ikhyunAn commented Nov 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

ikhyunAn commented Nov 4, 2025

Uh oh!

ikhyunAn commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Deterministic base-architecture preference for cross-family merges #641

Are you sure you want to change the base?

feat: Deterministic base-architecture preference for cross-family merges #641

Uh oh!

Conversation

ikhyunAn commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add deterministic base-architecture preference for cross-family merges

Description

Problems Solved

Solution

Changes

1. mergekit/architecture/__init__.py (+9 lines)

2. mergekit/merge_methods/slerp.py (+12, -5 lines)

3. Minor Improvements

Impact

Backward Compatibility

Use Cases Enabled

Testing

Unit Tests

Integration Testing

Test Config

Performance

Checklist

Related Issues

Questions for Reviewers

Migration Guide

Future Enhancements

Uh oh!

github-actions bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

ikhyunAn commented Nov 4, 2025

Uh oh!

ikhyunAn commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ikhyunAn commented Nov 4, 2025 •

edited

Loading

1. `mergekit/architecture/init.py` (+9 lines)

2. `mergekit/merge_methods/slerp.py` (+12, -5 lines)

github-actions bot commented Nov 4, 2025 •

edited

Loading