feat: Deterministic base-architecture preference for cross-family merges #641
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add deterministic base-architecture preference for cross-family merges
Description
This PR adds deterministic architecture selection for cross-family model merging, ensuring the output model consistently uses the
base_model's architecture rather than non-deterministically selecting from referenced models. This PR is an implementation on the feature requested: #640Problems Solved
1. Non-deterministic Architecture Selection
Previously, when merging models from different families (e.g., Llama + Qwen),
MergeConfiguration.referenced_models()used an unordered set, making architecture selection unpredictable. This could cause runtime errors when the selected architecture required tensors not present in the base model.Error Example:
2. Inflexible SLERP Parameter Requirements
SLERP required a
tvalue for every tensor, making selective component merging difficult. This prevented graceful fallback when merging models with incompatible tensor shapes.Solution
This PR implements two focused changes:
Changes
1.
mergekit/architecture/__init__.py(+9 lines)Explicitly prefer the
base_model's architecture when mixing different families:2.
mergekit/merge_methods/slerp.py(+12, -5 lines)Make
tparameter optional with safe fallback:3. Minor Improvements
mergekit/io/tensor_writer.py: Added path normalization for robustnessmergekit/tokenizer/embed.py: Fixed ZeroEmbedding initializationImpact
Backward Compatibility
Fully backward compatible:
tvalues: Continue working exactly as beforeUse Cases Enabled
This change enables new merge patterns:
Selective Cross-Family Component Merging:
Benefits:
Testing
Unit Tests
tparameter handling verifiedIntegration Testing
Successful cross-family merge:
k_norm)Results:
LlamaForCausalLM(base), not Qwenk_normtensors (correctly avoided)Test Config
Effective merge config (click to expand)
Performance
Checklist
Related Issues
Fixes/Addresses: #640
Questions for Reviewers
Architecture preference: Is always preferring
base_modelacceptable, or should we add a config option likearchitecture_source: base|first|merge?Optional t behavior: The current implementation returns base weights when
t=None. Is this the desired fallback, or would you prefer a different default?Alternative naming: Should the fallback be more explicit? e.g.,
t: "base"instead oft: None?Documentation: Should we add examples to the main README demonstrating cross-family merges? I have a private repository that has a merge option for cross-model family, layer-wise merging. I will make it available for testing if requested.
Migration Guide
No migration needed - this is a non-breaking change.
Existing configs will work identically. New cross-family merge patterns are opt-in via:
base_modelin config (recommended for cross-family)tvalues for tensors that should remain from baseFuture Enhancements
Potential follow-ups (out of scope for this PR):
architecture_sourceconfig optionTesting Environment:
Merge Successful: Real-world 18GB model produced and validated
Note
Deterministically selects the base model’s architecture for mixed-family merges, makes SLERP’s t optional with base-weight fallback, and adds path normalization plus zero-embedding fixes.
config.base_model’s architecture when multiple known architectures are present inmergekit/architecture/__init__.py, otherwise fall back to the first.toptional inmergekit/merge_methods/slerp.py; whent=None, return base model weights without merging to avoid shape issues.required=Falsewithdefault_value=None.ZeroEmbedding(kind="zero")for missing tokens and create correct-shaped zero vectors inmergekit/tokenizer/embed.py.os.path.abspath(os.path.expanduser(...))inmergekit/io/tensor_writer.py.Written by Cursor Bugbot for commit e85a454. This will update automatically on new commits. Configure here.