[OPTIMIZER] add weight split function for optimizer#415
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to split model parameters into sub-components for optimizer processing, specifically targeting QKV and MLP gate tensors. It adds utility functions to associate parameters with splitting logic and updates the optimizer container to apply these splits during initialization. However, the review identifies several critical issues: the splitting logic is applied to the incorrect parameter in the attention module, and a fundamental architectural flaw exists where standard optimizers cannot update parameter slices because gradients are only populated on the original leaf parameters. Additionally, there is an incorrect assertion comparing class types to strings and type hint mismatches in the utility functions.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
No description provided.