Skip to content

Conversation

@ikhyunAn
Copy link

@ikhyunAn ikhyunAn commented Nov 4, 2025

Add deterministic base-architecture preference for cross-family merges

Description

This PR adds deterministic architecture selection for cross-family model merging, ensuring the output model consistently uses the base_model's architecture rather than non-deterministically selecting from referenced models. This PR is an implementation on the feature requested: #640

Problems Solved

1. Non-deterministic Architecture Selection

Previously, when merging models from different families (e.g., Llama + Qwen), MergeConfiguration.referenced_models() used an unordered set, making architecture selection unpredictable. This could cause runtime errors when the selected architecture required tensors not present in the base model.

Error Example:

RuntimeError: Tensor model.layers.0.self_attn.k_norm.weight required but not present 
in model TsinghuaC3I/Llama-3-8B-UltraMedical (HF)

2. Inflexible SLERP Parameter Requirements

SLERP required a t value for every tensor, making selective component merging difficult. This prevented graceful fallback when merging models with incompatible tensor shapes.

Solution

This PR implements two focused changes:

  1. Prefer base_model architecture when multiple known architectures are present
  2. Make SLERP t parameter optional with graceful fallback to base model weights

Changes

1. mergekit/architecture/__init__.py (+9 lines)

Explicitly prefer the base_model's architecture when mixing different families:

# Prefer using the base_model's architecture when mixing different families
# to ensure the output layout matches the base.
if config.base_model is not None:
    try:
        idx = models.index(config.base_model)
        return model_arch_info[idx]
    except ValueError:
        # base_model not in referenced models; fall back to first
        pass
return model_arch_info[0]

2. mergekit/merge_methods/slerp.py (+12, -5 lines)

Make t parameter optional with safe fallback:

class SlerpTask(Task[torch.Tensor]):
    t: Optional[float]  # Changed from: t: float
    
    def execute(self, **kwargs) -> torch.Tensor:
        # ... existing validation ...
        
        # If no interpolation parameter was provided for this tensor, do not attempt to merge;
        # simply return the base model's weight unchanged. This avoids shape/broadcast errors
        # when the secondary model has incompatible tensor shapes.
        if self.t is None:
            return tensors[self.base_model]
        
        # ... rest of SLERP logic unchanged ...

class SlerpMerge(MergeMethod):
    def parameters(self) -> List[ConfigParameterDef]:
        return [ConfigParameterDef(name="t", required=False, default_value=None)]

3. Minor Improvements

  • mergekit/io/tensor_writer.py: Added path normalization for robustness
  • mergekit/tokenizer/embed.py: Fixed ZeroEmbedding initialization

Impact

Backward Compatibility

Fully backward compatible:

  • Same-family merges: No behavioral changes
  • Existing configs with explicit t values: Continue working exactly as before
  • Cross-family merges: Now deterministic and base-aligned (previously undefined behavior)

Use Cases Enabled

This change enables new merge patterns:

Selective Cross-Family Component Merging:

merge_method: slerp
base_model: meta-llama/Llama-3-8B
slices:
  - sources:
      - model: meta-llama/Llama-3-8B
        layer_range: [5, 6]
      - model: Qwen/Qwen2.5-7B
        layer_range: [10, 11]
    parameters:
      t:
        - filter: mlp.gate_proj.weight
          value: 0.5
        - filter: mlp.up_proj.weight
          value: 0.5
        - filter: mlp.down_proj.weight
          value: 0.5
        # Other tensors (attention, norms) default to base via t=None

Benefits:

  • Transfer specific capabilities between model families
  • Map different layer indices (base layer 5 ← merge layer 10)
  • Preserve base architecture while selectively incorporating components
  • Safe handling of incompatible tensor shapes

Testing

Unit Tests

  • Existing tests pass without modification
  • Architecture selection logic validated
  • Optional t parameter handling verified

Integration Testing

Successful cross-family merge:

  • Base: Llama-3-8B-UltraMedical (32 layers, Llama architecture)
  • Merge: II-Medical-8B (36 layers, Qwen architecture with k_norm)
  • Layer mappings: 0→0, 16→18, 31→35
  • Components: Attention only (q_proj, k_proj, v_proj, o_proj)

Results:

  • Output model: 18GB, 4 safetensor shards
  • Architecture: LlamaForCausalLM (base), not Qwen
  • No k_norm tensors (correctly avoided)
  • Model loads successfully
  • Deterministic: Same config produces identical results

Test Config

Effective merge config (click to expand)
merge_method: slerp
base_model: TsinghuaC3I/Llama-3-8B-UltraMedical
dtype: bfloat16
slices:
  # Mapped layer: base 0 ← merge 0
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [0, 1]
      - model: Intelligent-Internet/II-Medical-8B
        layer_range: [0, 1]
    parameters:
      t:
        - filter: self_attn.q_proj.weight
          value: 0.3
        - filter: self_attn.k_proj.weight
          value: 0.3
        # ... k/v/o_proj similar
  
  # Unmapped layer: pure base
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [1, 2]
    parameters:
      t:
        - value: 0.0
  
  # ... layers 2-15 similar (base only) ...
  
  # Cross-layer mapped: base 16 ← merge 18
  - sources:
      - model: TsinghuaC3I/Llama-3-8B-UltraMedical
        layer_range: [16, 17]
      - model: Intelligent-Internet/II-Medical-8B
        layer_range: [18, 19]
    parameters:
      t:
        - filter: self_attn.q_proj.weight
          value: 0.3
        # ... other attention components

Performance

  • Code size: +21 lines, -5 lines (net +16 lines)
  • Runtime impact: Negligible (architecture selection happens once at merge start)
  • Memory impact: None
  • Merge speed: No change

Checklist

  • Code follows project style guidelines
  • Changes are minimal and focused
  • Backward compatibility maintained
  • Existing tests pass
  • Integration test successful (real cross-family merge)
  • Documentation updated (inline comments)
  • No breaking changes

Related Issues

Fixes/Addresses: #640

Questions for Reviewers

  1. Architecture preference: Is always preferring base_model acceptable, or should we add a config option like architecture_source: base|first|merge?

  2. Optional t behavior: The current implementation returns base weights when t=None. Is this the desired fallback, or would you prefer a different default?

  3. Alternative naming: Should the fallback be more explicit? e.g., t: "base" instead of t: None?

  4. Documentation: Should we add examples to the main README demonstrating cross-family merges? I have a private repository that has a merge option for cross-model family, layer-wise merging. I will make it available for testing if requested.

Migration Guide

No migration needed - this is a non-breaking change.

Existing configs will work identically. New cross-family merge patterns are opt-in via:

  • Using base_model in config (recommended for cross-family)
  • Omitting t values for tensors that should remain from base

Future Enhancements

Potential follow-ups (out of scope for this PR):

  • Explicit architecture_source config option
  • Per-slice architecture override
  • Automatic shape compatibility validation
  • Support for other merge methods (DARE, TIES, etc.)

Testing Environment:

  • Python: 3.10+
  • PyTorch: 2.0+
  • Hardware: Multi-GPU setup (NVIDIA H200)
  • Models tested: Llama-3 8B, Qwen 7B variants

Merge Successful: Real-world 18GB model produced and validated


Note

Deterministically selects the base model’s architecture for mixed-family merges, makes SLERP’s t optional with base-weight fallback, and adds path normalization plus zero-embedding fixes.

  • Architecture:
    • Prefer config.base_model’s architecture when multiple known architectures are present in mergekit/architecture/__init__.py, otherwise fall back to the first.
  • Merge Methods (SLERP):
    • Make t optional in mergekit/merge_methods/slerp.py; when t=None, return base model weights without merging to avoid shape issues.
    • Update parameter schema to required=False with default_value=None.
  • Tokenizer/Embeddings:
    • Use ZeroEmbedding(kind="zero") for missing tokens and create correct-shaped zero vectors in mergekit/tokenizer/embed.py.
  • IO:
    • Normalize output path with os.path.abspath(os.path.expanduser(...)) in mergekit/io/tensor_writer.py.

Written by Cursor Bugbot for commit e85a454. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

cursor[bot]

This comment was marked as outdated.

@ikhyunAn
Copy link
Author

ikhyunAn commented Nov 4, 2025

I have read the CLA Document and I hereby sign the CLA

@ikhyunAn
Copy link
Author

ikhyunAn commented Nov 4, 2025

I'm not familiar with pre-commit, but I installed it and ran it on --all-files. "Pending Check" status should be updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant