Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 4, 2025

📄 15% (0.15x) speedup for AGLU.forward in ultralytics/nn/modules/activation.py

⏱️ Runtime : 205 milliseconds 179 milliseconds (best of 29 runs)

📝 Explanation and details

The optimized code achieves a 14% speedup by breaking down the complex nested expression into separate, more efficient operations and leveraging PyTorch's optimized tensor methods.

Key optimizations applied:

  1. Efficient reciprocal and logarithm operations: Instead of using division (1 / lam) and torch.log(lam), the code uses lam.reciprocal() and lam.log() which are PyTorch's optimized native methods that avoid overhead from generic operations.

  2. Intermediate result reuse: The original code computed torch.log(lam) twice within the nested expression. The optimized version computes log_lam once and reuses it, eliminating redundant computation.

  3. Operation decomposition: Breaking the complex nested expression into discrete steps (kappa_x, splus, exp_input) allows PyTorch to optimize each operation individually and potentially enables better memory access patterns.

  4. In-place subtraction: Using kappa_x.sub_(log_lam) modifies the tensor in-place when safe to do so, potentially reducing memory allocations.

Performance impact: The line profiler shows the original single-line computation took 99.6% of execution time (217ms), while the optimized version distributes this across multiple optimized operations totaling ~99.3% (185ms). The optimizations are particularly effective for:

  • Small to medium tensors (15-36% speedup in many test cases)
  • Matrix operations (24-36% improvement for 2D tensors)
  • Edge cases with extreme parameter values (20-33% faster)

Workload benefits: Since AGLU is an activation function in a neural network module, these optimizations will compound across multiple forward passes during training/inference, making the 14% per-call improvement significant for model performance.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 86 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
import torch
from ultralytics.nn.modules.activation import AGLU

# unit tests

# -------- BASIC TEST CASES --------


def test_forward_basic_scalar():
    """Test forward pass with a single scalar input tensor."""
    aglu = AGLU()
    x = torch.tensor(1.0)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 82.9μs -> 72.6μs (14.2% faster)


def test_forward_basic_vector():
    """Test forward pass with a small 1D input tensor."""
    aglu = AGLU()
    x = torch.tensor([0.0, 1.0, -1.0])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 63.7μs -> 55.4μs (15.0% faster)


def test_forward_basic_matrix():
    """Test forward pass with a small 2D input tensor."""
    aglu = AGLU()
    x = torch.tensor([[0.0, 1.0], [2.0, -2.0]])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 62.9μs -> 50.5μs (24.4% faster)


def test_forward_basic_grad():
    """Test that output is differentiable (grad can be computed)."""
    aglu = AGLU()
    x = torch.tensor([0.5, -0.5], requires_grad=True)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 56.2μs -> 46.9μs (19.7% faster)
    # Should be able to compute gradients
    out.sum().backward()


# -------- EDGE TEST CASES --------


def test_forward_zero_input():
    """Test forward pass with all zeros input."""
    aglu = AGLU()
    x = torch.zeros(5)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 63.5μs -> 54.1μs (17.4% faster)


def test_forward_large_positive_input():
    """Test forward pass with large positive values."""
    aglu = AGLU()
    x = torch.tensor([1e2, 1e3, 1e4])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 60.5μs -> 50.3μs (20.1% faster)


def test_forward_large_negative_input():
    """Test forward pass with large negative values."""
    aglu = AGLU()
    x = torch.tensor([-1e2, -1e3, -1e4])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 58.5μs -> 49.6μs (18.0% faster)


def test_forward_inf_nan_input():
    """Test forward pass with inf and nan input values."""
    aglu = AGLU()
    x = torch.tensor([float("inf"), float("-inf"), float("nan")])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 55.1μs -> 50.3μs (9.52% faster)


def test_forward_extreme_lambda_kappa():
    """Test with extreme values for lambda and kappa parameters."""
    aglu = AGLU()
    # Set lambda to a very small value (should be clamped to 0.0001)
    aglu.lambd.data.fill_(1e-8)
    aglu.kappa.data.fill_(10.0)
    x = torch.tensor([1.0, -1.0])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 53.5μs -> 45.7μs (17.2% faster)


def test_forward_negative_lambda():
    """Test with negative lambda parameter (should be clamped to 0.0001)."""
    aglu = AGLU()
    aglu.lambd.data.fill_(-1.0)
    x = torch.tensor([0.5])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 58.0μs -> 47.5μs (22.1% faster)


def test_forward_empty_input():
    """Test forward pass with empty tensor input."""
    aglu = AGLU()
    x = torch.tensor([])
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 51.7μs -> 44.9μs (15.1% faster)


def test_forward_high_dimensional_input():
    """Test with high-dimensional tensor input (e.g., 4D)."""
    aglu = AGLU()
    x = torch.ones((2, 3, 4, 5))
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 60.9μs -> 49.4μs (23.5% faster)


# -------- LARGE SCALE TEST CASES --------


def test_forward_large_batch():
    """Test forward pass with a large batch of inputs (1D tensor, <1000 elements)."""
    aglu = AGLU()
    x = torch.randn(999)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 62.4μs -> 51.5μs (21.0% faster)


def test_forward_large_matrix():
    """Test forward pass with a large 2D tensor (<1000x1000 elements, but keep under 100MB)."""
    aglu = AGLU()
    # 300 x 300 floats is about 0.36MB, well under limit
    x = torch.randn(300, 300)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 244μs -> 221μs (10.5% faster)


def test_forward_large_3d_tensor():
    """Test forward pass with a large 3D tensor."""
    aglu = AGLU()
    x = torch.randn(20, 20, 20)  # 8000 elements
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 88.0μs -> 76.2μs (15.5% faster)


def test_forward_performance_large_input():
    """Test that forward pass executes quickly for large input (timing test)."""
    import time

    aglu = AGLU()
    x = torch.randn(500, 50)
    start = time.time()
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 114μs -> 99.3μs (14.8% faster)
    elapsed = time.time() - start


# -------- MISCELLANEOUS TESTS --------


def test_forward_dtype_float32():
    """Test forward pass with float32 input."""
    aglu = AGLU(dtype=torch.float32)
    x = torch.randn(10, dtype=torch.float32)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 60.1μs -> 46.2μs (30.1% faster)


def test_forward_dtype_float64():
    """Test forward pass with float64 input."""
    aglu = AGLU(dtype=torch.float64)
    x = torch.randn(10, dtype=torch.float64)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 57.7μs -> 46.9μs (23.1% faster)


def test_forward_device_cpu():
    """Test forward pass on CPU."""
    aglu = AGLU(device="cpu")
    x = torch.randn(5)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 50.1μs -> 41.4μs (20.9% faster)


@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
def test_forward_non_contiguous_input():
    """Test forward pass with non-contiguous tensor input."""
    aglu = AGLU()
    x = torch.randn(10, 10)
    x_t = x.t()  # Transpose makes tensor non-contiguous
    codeflash_output = aglu.forward(x_t)
    out = codeflash_output  # 69.1μs -> 55.8μs (24.0% faster)


def test_forward_requires_grad_output():
    """Test that output requires grad if input requires grad."""
    aglu = AGLU()
    x = torch.randn(5, requires_grad=True)
    codeflash_output = aglu.forward(x)
    out = codeflash_output  # 50.3μs -> 42.2μs (19.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import torch
from ultralytics.nn.modules.activation import AGLU

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_forward_scalar_input():
    """Test forward with a scalar input tensor."""
    aglu = AGLU()
    x = torch.tensor(1.0)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 56.2μs -> 45.0μs (24.9% faster)


def test_forward_vector_input():
    """Test forward with a 1D tensor input."""
    aglu = AGLU()
    x = torch.tensor([0.0, 1.0, -1.0, 2.0])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 55.0μs -> 46.2μs (19.0% faster)


def test_forward_matrix_input():
    """Test forward with a 2D tensor input."""
    aglu = AGLU()
    x = torch.tensor([[0.0, 1.0], [-1.0, 2.0]])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 60.6μs -> 44.4μs (36.5% faster)


def test_forward_requires_grad():
    """Test that gradients can be computed through the forward pass."""
    aglu = AGLU()
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 59.7μs -> 44.1μs (35.4% faster)
    y.sum().backward()


def test_forward_different_device_dtype():
    """Test forward on a different device and dtype if available."""
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    dtype = torch.float64
    aglu = AGLU(device=device, dtype=dtype)
    x = torch.randn(3, 3, device=device, dtype=dtype)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 65.2μs -> 55.4μs (17.6% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_forward_zero_input():
    """Test forward with all-zero input."""
    aglu = AGLU()
    x = torch.zeros(10)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 58.7μs -> 47.7μs (23.1% faster)


def test_forward_large_positive_input():
    """Test forward with large positive values."""
    aglu = AGLU()
    x = torch.full((5,), 100.0)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 60.0μs -> 46.1μs (30.3% faster)


def test_forward_large_negative_input():
    """Test forward with large negative values."""
    aglu = AGLU()
    x = torch.full((5,), -100.0)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 54.3μs -> 45.1μs (20.4% faster)


def test_forward_extremely_small_lambda():
    """Test forward when lambda is (almost) zero, should be clamped to 0.0001."""
    aglu = AGLU()
    aglu.lambd.data.fill_(-1e6)  # Set to a very negative value
    x = torch.tensor([1.0])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 55.7μs -> 43.0μs (29.4% faster)
    # Check that lambda was clamped
    lam = torch.clamp(aglu.lambd, min=0.0001)


def test_forward_extremely_large_lambda():
    """Test forward when lambda is very large."""
    aglu = AGLU()
    aglu.lambd.data.fill_(1e6)
    x = torch.tensor([1.0])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 57.7μs -> 45.8μs (26.0% faster)


def test_forward_extremely_large_kappa():
    """Test forward with very large kappa parameter."""
    aglu = AGLU()
    aglu.kappa.data.fill_(1e6)
    x = torch.tensor([1.0])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 52.2μs -> 43.4μs (20.2% faster)


def test_forward_extremely_negative_kappa():
    """Test forward with very negative kappa parameter."""
    aglu = AGLU()
    aglu.kappa.data.fill_(-1e6)
    x = torch.tensor([1.0])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 55.8μs -> 42.3μs (32.0% faster)


def test_forward_nan_input():
    """Test forward with NaN input."""
    aglu = AGLU()
    x = torch.tensor([float("nan")])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 55.0μs -> 47.0μs (16.9% faster)


def test_forward_inf_input():
    """Test forward with inf input."""
    aglu = AGLU()
    x = torch.tensor([float("inf")])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 58.7μs -> 43.9μs (33.8% faster)


def test_forward_negative_inf_input():
    """Test forward with -inf input."""
    aglu = AGLU()
    x = torch.tensor([-float("inf")])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 58.5μs -> 43.8μs (33.8% faster)


def test_forward_empty_tensor():
    """Test forward with an empty tensor."""
    aglu = AGLU()
    x = torch.tensor([])
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 53.5μs -> 41.9μs (27.7% faster)


def test_forward_high_dimensional_tensor():
    """Test forward with a high-dimensional tensor."""
    aglu = AGLU()
    x = torch.randn(2, 3, 4, 5)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 60.1μs -> 49.9μs (20.4% faster)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_forward_large_batch():
    """Test forward with a large batch of data (1D tensor)."""
    aglu = AGLU()
    x = torch.randn(1000)  # 1000 elements
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 61.8μs -> 50.6μs (22.1% faster)


def test_forward_large_matrix():
    """Test forward with a large 2D tensor."""
    aglu = AGLU()
    x = torch.randn(100, 100)  # 10,000 elements, ~40KB
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 85.2μs -> 73.7μs (15.5% faster)


def test_forward_large_3d_tensor():
    """Test forward with a large 3D tensor."""
    aglu = AGLU()
    x = torch.randn(10, 10, 10)  # 1,000 elements
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 66.6μs -> 50.2μs (32.8% faster)


def test_forward_large_tensor_memory_limit():
    """Test forward with tensor close to 100MB memory limit."""
    aglu = AGLU()
    # Each float32 is 4 bytes. 25,000,000 elements = 100MB
    x = torch.randn(25_000_000)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 199ms -> 173ms (14.7% faster)


def test_forward_backward_large_tensor():
    """Test forward and backward on a large tensor to check for memory leaks or slowdowns."""
    aglu = AGLU()
    x = torch.randn(1_000_000, requires_grad=True)
    codeflash_output = aglu.forward(x)
    y = codeflash_output  # 2.95ms -> 2.86ms (3.19% faster)
    loss = y.sum()
    loss.backward()


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-AGLU.forward-mirfznce and push.

Codeflash Static Badge

The optimized code achieves a **14% speedup** by breaking down the complex nested expression into separate, more efficient operations and leveraging PyTorch's optimized tensor methods.

**Key optimizations applied:**

1. **Efficient reciprocal and logarithm operations**: Instead of using division (`1 / lam`) and `torch.log(lam)`, the code uses `lam.reciprocal()` and `lam.log()` which are PyTorch's optimized native methods that avoid overhead from generic operations.

2. **Intermediate result reuse**: The original code computed `torch.log(lam)` twice within the nested expression. The optimized version computes `log_lam` once and reuses it, eliminating redundant computation.

3. **Operation decomposition**: Breaking the complex nested expression into discrete steps (`kappa_x`, `splus`, `exp_input`) allows PyTorch to optimize each operation individually and potentially enables better memory access patterns.

4. **In-place subtraction**: Using `kappa_x.sub_(log_lam)` modifies the tensor in-place when safe to do so, potentially reducing memory allocations.

**Performance impact:** The line profiler shows the original single-line computation took 99.6% of execution time (217ms), while the optimized version distributes this across multiple optimized operations totaling ~99.3% (185ms). The optimizations are particularly effective for:
- Small to medium tensors (15-36% speedup in many test cases)
- Matrix operations (24-36% improvement for 2D tensors)
- Edge cases with extreme parameter values (20-33% faster)

**Workload benefits:** Since AGLU is an activation function in a neural network module, these optimizations will compound across multiple forward passes during training/inference, making the 14% per-call improvement significant for model performance.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 4, 2025 12:59
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant