⚡️ Speed up method AGLU.forward by 15%
#54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 15% (0.15x) speedup for
AGLU.forwardinultralytics/nn/modules/activation.py⏱️ Runtime :
205 milliseconds→179 milliseconds(best of29runs)📝 Explanation and details
The optimized code achieves a 14% speedup by breaking down the complex nested expression into separate, more efficient operations and leveraging PyTorch's optimized tensor methods.
Key optimizations applied:
Efficient reciprocal and logarithm operations: Instead of using division (
1 / lam) andtorch.log(lam), the code useslam.reciprocal()andlam.log()which are PyTorch's optimized native methods that avoid overhead from generic operations.Intermediate result reuse: The original code computed
torch.log(lam)twice within the nested expression. The optimized version computeslog_lamonce and reuses it, eliminating redundant computation.Operation decomposition: Breaking the complex nested expression into discrete steps (
kappa_x,splus,exp_input) allows PyTorch to optimize each operation individually and potentially enables better memory access patterns.In-place subtraction: Using
kappa_x.sub_(log_lam)modifies the tensor in-place when safe to do so, potentially reducing memory allocations.Performance impact: The line profiler shows the original single-line computation took 99.6% of execution time (217ms), while the optimized version distributes this across multiple optimized operations totaling ~99.3% (185ms). The optimizations are particularly effective for:
Workload benefits: Since AGLU is an activation function in a neural network module, these optimizations will compound across multiple forward passes during training/inference, making the 14% per-call improvement significant for model performance.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-AGLU.forward-mirfznceand push.