Left matmul 100x performance improvements#44
Merged
Conversation
Transurgeon
reviewed
Feb 13, 2026
Collaborator
Author
|
I tested that it works on the python side and added a few extra tests on the python side. I also addressed your comments @Transurgeon, so I'm merging this. |
Transurgeon
added a commit
that referenced
this pull request
Feb 13, 2026
Sync parameter-support with main's left matmul 100x perf improvements (PR #44) and right matmul refactor (PR #46). Simplify param matmul to store only the small A matrix instead of block-diagonal — block_left_multiply_* functions handle the rest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR accelerates left matrix multiplication by 100x (!) by avoiding explicit Kronecker product construction. Instead of treating the operation as a generic sparse matrix–matrix multiply, we use specialized logic that exploits the block/Kronecker structure. The initialization that took 15 seconds on one of Max's problems now takes 0.17 seconds.
This will make the parameter code for left matmul much simpler, since you only need to update A when refreshing a parameter @Transurgeon.
I have not tested that our Python tests in DNLP pass with this code, so let's wait with merging it until I've done so.
We should also do this refactor for right matmul, but that's for another day. I wonder if claude can code it up by mimicking my implementation of left_matmul? @Transurgeon