Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 4, 2025

📄 554% (5.54x) speedup for compute_pooling_output_shape in keras/src/ops/operation_utils.py

⏱️ Runtime : 726 microseconds 111 microseconds (best of 22 runs)

📝 Explanation and details

The optimized code achieves a 554% speedup by introducing a fast path for integer-only dimensions that avoids NumPy overhead entirely.

Key optimizations:

  1. Conditional NumPy usage: Instead of always converting inputs to NumPy arrays, the code first checks if any spatial dimensions are None. When all dimensions are known integers (the common case), it performs calculations using native Python integer arithmetic, which is much faster for small arrays.

  2. Eliminated unnecessary array conversions: The original code created np.array(input_shape), np.array(pool_size), and performed expensive NumPy operations like np.floor() and division even for simple cases. The optimized version uses // (integer division) and basic arithmetic when possible.

  3. Reduced memory allocations: Changed input_shape_origin = list(input_shape) to input_shape_origin = tuple(input_shape) to avoid an unnecessary list creation, and only creates NumPy arrays when None dimensions are present.

Performance impact by use case:

  • Standard pooling operations (no None dimensions): Take the fast integer-only path, showing 4-7x speedups across test cases
  • Dynamic shape scenarios (with None dimensions): Still use NumPy but with more efficient array creation using dtype=np.intp and np.floor_divide

Hot path significance: This function is called from pooling layers' compute_output_shape() and ops' compute_output_spec() methods, making it critical for model compilation and shape inference. The optimization particularly benefits common CNN architectures where pooling shapes are typically known at compile time, allowing the fast integer-only path to be used consistently.

The optimizations maintain identical behavior while dramatically reducing computational overhead for the most frequent use cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 28 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 82.1%
🌀 Generated Regression Tests and Runtime
import numpy as np
# imports
import pytest
from keras.src.ops.operation_utils import compute_pooling_output_shape

# -------------------------
# unit tests
# -------------------------

# BASIC TEST CASES

def test_basic_strided_pooling():
    # 4x4 input, 2x2 pool, stride 1, channels_last
    codeflash_output = compute_pooling_output_shape((1, 4, 4, 1), (2, 2), strides=(1, 1)) # 31.5μs -> 4.31μs (630% faster)

def test_basic_strided_pooling_channels_first():
    # 4x4 input, 2x2 pool, stride 1, channels_first
    codeflash_output = compute_pooling_output_shape((1, 1, 4, 4), (2, 2), strides=(1, 1), data_format="channels_first") # 19.9μs -> 3.76μs (428% faster)

def test_strides_different_from_pool_size():
    # 6x6 input, 2x2 pool, strides 3, channels_last
    codeflash_output = compute_pooling_output_shape((1, 6, 6, 1), (2, 2), strides=(3, 3)) # 31.9μs -> 4.23μs (654% faster)

# EDGE TEST CASES

def test_non_integer_output_shape():
    # 5x5 input, 2x2 pool, stride 2, valid padding, output shape should be floor((5-2)/2)+1 = 2
    codeflash_output = compute_pooling_output_shape((1, 5, 5, 1), (2, 2), strides=(2, 2)) # 32.2μs -> 4.39μs (633% faster)

def test_non_integer_output_shape_same():
    # 5x5 input, 2x2 pool, stride 2, same padding, output shape should be floor((5-1)/2)+1 = 3
    codeflash_output = compute_pooling_output_shape((1, 5, 5, 1), (2, 2), strides=(2, 2), padding="same") # 19.9μs -> 3.73μs (433% faster)

def test_1d_pooling_channels_last():
    # 1D pooling, input (batch, steps, channels)
    codeflash_output = compute_pooling_output_shape((2, 10, 3), (3,), strides=(2,)) # 19.1μs -> 3.61μs (430% faster)

def test_1d_pooling_channels_first():
    # 1D pooling, input (batch, channels, steps)
    codeflash_output = compute_pooling_output_shape((2, 3, 10), (3,), strides=(2,), data_format="channels_first") # 19.9μs -> 3.46μs (475% faster)

def test_strides_none_defaults_to_pool_size():
    # If strides is None, should default to pool_size
    codeflash_output = compute_pooling_output_shape((1, 6, 6, 1), (2, 2), strides=None) # 32.4μs -> 4.31μs (652% faster)

def test_pool_size_one_stride_one():
    # Pool size 1, stride 1, should return input spatial shape
    codeflash_output = compute_pooling_output_shape((1, 5, 5, 1), (1, 1), strides=(1, 1)) # 21.0μs -> 3.80μs (452% faster)

def test_large_stride():
    # Large stride, should reduce output shape significantly
    batch = 8
    height = 100
    width = 100
    channels = 3
    pool_size = (3, 3)
    strides = (10, 10)
    expected_shape = (batch, (height - 3) // 10 + 1, (width - 3) // 10 + 1, channels)
    codeflash_output = compute_pooling_output_shape((batch, height, width, channels), pool_size, strides=strides) # 31.8μs -> 4.24μs (650% faster)

def test_large_1d_pooling():
    # Large 1D input
    batch = 16
    steps = 999
    channels = 8
    pool_size = (3,)
    strides = (2,)
    expected_shape = (batch, (steps - 3) // 2 + 1, channels)
    codeflash_output = compute_pooling_output_shape((batch, steps, channels), pool_size, strides=strides) # 31.3μs -> 4.12μs (660% faster)
import numpy as np
# imports
import pytest
from keras.src.ops.operation_utils import compute_pooling_output_shape

# unit tests

# --------------------------
# Basic Test Cases
# --------------------------

def test_basic_strided_pooling():
    # Stride 1, pool 2x2, should slide over 3 positions
    codeflash_output = compute_pooling_output_shape((1, 4, 4, 1), (2, 2), strides=(1, 1)); out = codeflash_output # 31.7μs -> 4.42μs (616% faster)

def test_basic_non_square_pool():
    # Non-square pool, pool 2x3, stride 2x2
    codeflash_output = compute_pooling_output_shape((1, 6, 7, 1), (2, 3), strides=(2, 2)); out = codeflash_output # 32.1μs -> 4.33μs (642% faster)

def test_basic_strides_larger_than_pool():
    # Strides larger than pool size
    codeflash_output = compute_pooling_output_shape((1, 10, 10, 1), (2, 2), strides=(5, 5)); out = codeflash_output # 20.6μs -> 3.86μs (433% faster)

def test_basic_1d_pooling():
    # 1D pooling, e.g. (batch, steps, channels)
    codeflash_output = compute_pooling_output_shape((8, 16, 3), (2,), strides=(2,)); out = codeflash_output # 18.7μs -> 3.53μs (428% faster)

def test_edge_strides_none_explicit():
    # Explicitly passing strides=None should default to pool_size
    codeflash_output = compute_pooling_output_shape((1, 6, 6, 1), (2, 2), strides=None); out = codeflash_output # 32.1μs -> 4.47μs (618% faster)

def test_edge_uneven_division_valid():
    # Pool and stride that do not evenly divide input
    codeflash_output = compute_pooling_output_shape((1, 7, 7, 1), (3, 3), strides=(2, 2)); out = codeflash_output # 21.1μs -> 3.85μs (448% faster)

def test_edge_uneven_division_same():
    # SAME padding with uneven division
    codeflash_output = compute_pooling_output_shape((1, 7, 7, 1), (3, 3), strides=(2, 2), padding="same"); out = codeflash_output # 18.8μs -> 3.78μs (396% faster)

def test_edge_stride_one():
    # Stride 1 should slide over all positions
    codeflash_output = compute_pooling_output_shape((1, 5, 5, 1), (2, 2), strides=(1, 1)); out = codeflash_output # 31.5μs -> 4.19μs (651% faster)

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_large_spatial_dims():
    # Large spatial dims, check output shape
    codeflash_output = compute_pooling_output_shape((2, 512, 512, 1), (3, 3), strides=(2, 2)); out = codeflash_output # 31.8μs -> 4.44μs (617% faster)

def test_large_strides():
    # Large stride, should reduce output to 1
    codeflash_output = compute_pooling_output_shape((1, 64, 64, 1), (3, 3), strides=(64, 64)); out = codeflash_output # 31.6μs -> 4.24μs (645% faster)

def test_large_non_square():
    # Large, non-square spatial dims and pool
    codeflash_output = compute_pooling_output_shape((10, 128, 256, 3), (5, 7), strides=(4, 8)); out = codeflash_output # 20.3μs -> 3.68μs (452% faster)

def test_large_channels_first():
    # Large, channels_first
    codeflash_output = compute_pooling_output_shape((8, 512, 64, 64), (2, 2), data_format="channels_first"); out = codeflash_output

# --------------------------
# Additional Robustness Tests
# --------------------------

@pytest.mark.parametrize(
    "input_shape,pool_size,strides,padding,data_format,expected",
    [
        # 1D, 2D, 3D, channels_last, channels_first, valid/same
        ((10, 20, 3), (2,), (2,), "valid", "channels_last", (10, 10, 3)),
        ((10, 3, 20), (2,), (2,), "valid", "channels_first", (10, 3, 10)),
        ((2, 7, 8, 9, 4), (2, 3, 4), (2, 3, 4), "valid", "channels_last", (2, 3, 2, 2, 4)),
        ((2, 4, 7, 8, 9), (2, 3, 4), (2, 3, 4), "valid", "channels_first", (2, 4, 3, 2, 2)),
        ((1, 5, 5, 1), (2, 2), (1, 1), "same", "channels_last", (1, 5, 5, 1)),
        ((1, 1, 5, 5), (2, 2), (1, 1), "same", "channels_first", (1, 1, 5, 5)),
    ]
)
def test_parametrized_various_cases(input_shape, pool_size, strides, padding, data_format, expected):
    # Parametrized test for various input scenarios
    codeflash_output = compute_pooling_output_shape(input_shape, pool_size, strides, padding, data_format); out = codeflash_output # 145μs -> 22.3μs (553% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-compute_pooling_output_shape-mireyo77 and push.

Codeflash Static Badge

The optimized code achieves a **554% speedup** by introducing a **fast path for integer-only dimensions** that avoids NumPy overhead entirely.

**Key optimizations:**

1. **Conditional NumPy usage**: Instead of always converting inputs to NumPy arrays, the code first checks if any spatial dimensions are `None`. When all dimensions are known integers (the common case), it performs calculations using native Python integer arithmetic, which is much faster for small arrays.

2. **Eliminated unnecessary array conversions**: The original code created `np.array(input_shape)`, `np.array(pool_size)`, and performed expensive NumPy operations like `np.floor()` and division even for simple cases. The optimized version uses `//` (integer division) and basic arithmetic when possible.

3. **Reduced memory allocations**: Changed `input_shape_origin = list(input_shape)` to `input_shape_origin = tuple(input_shape)` to avoid an unnecessary list creation, and only creates NumPy arrays when `None` dimensions are present.

**Performance impact by use case:**
- **Standard pooling operations** (no `None` dimensions): Take the fast integer-only path, showing 4-7x speedups across test cases
- **Dynamic shape scenarios** (with `None` dimensions): Still use NumPy but with more efficient array creation using `dtype=np.intp` and `np.floor_divide`

**Hot path significance**: This function is called from pooling layers' `compute_output_shape()` and ops' `compute_output_spec()` methods, making it critical for model compilation and shape inference. The optimization particularly benefits common CNN architectures where pooling shapes are typically known at compile time, allowing the fast integer-only path to be used consistently.

The optimizations maintain identical behavior while dramatically reducing computational overhead for the most frequent use cases.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 4, 2025 12:30
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant