⚡️ Speed up function _to_backend_layout by 40%
#156
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 40% (0.40x) speedup for
_to_backend_layoutinkeras/src/backend/tensorflow/distribution_lib.py⏱️ Runtime :
178 microseconds→126 microseconds(best of250runs)📝 Explanation and details
The optimization achieves a 40% speedup by introducing a fast path for the common case where all tensor axes are sharded (truthy values).
Key optimizations:
Fast path optimization: Added
if all(axes):check to detect when all axes are sharded. In this case,list(axes)is used instead of the list comprehension, which is significantly faster since it avoids per-element conditional evaluation.Local variable caching:
dtensor.UNSHARDEDis cached in a local variableunshardedto reduce attribute lookup overhead in the list comprehension.Performance impact by test case:
all()check overheadThe optimization is particularly effective for large tensor layouts with many axes (common in distributed machine learning), where the fast path provides substantial gains. The slight regression for all-unsharded cases is outweighed by the significant improvements for sharded tensors, which are likely more common in production distributed training scenarios.
The line profiler shows the original list comprehension took 95% of execution time, now reduced to 81.2% with the fast path handling 1% of cases efficiently.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-_to_backend_layout-mire0bbyand push.