Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 4, 2025

📄 12% (0.12x) speedup for _scale_bounding_box_to_original_image_shape in ultralytics/utils/callbacks/comet.py

⏱️ Runtime : 237 milliseconds 211 milliseconds (best of 36 runs)

📝 Explanation and details

The optimizations achieve a 12% speedup by improving memory access patterns and reducing computational overhead in tensor operations:

Key Optimizations:

  1. xyxy2xywh function (35% faster): Replaced four individual tensor assignments with two vectorized operations using slice notation:

    • y[..., :2] = (x[..., :2] + x[..., 2:]) / 2 computes both center coordinates in one operation
    • y[..., 2:] = x[..., 2:] - x[..., :2] computes both width/height in one operation
    • This reduces tensor indexing overhead from 4 operations to 2 and improves memory locality
  2. xywhn2xyxy function (7% faster): Extracted repeated tensor indexing into intermediate variables:

    • c_x = x[..., 0], c_y = x[..., 1], etc. avoid redundant indexing operations
    • Each coordinate calculation now references the cached values instead of re-indexing the tensor

Why This Works:

  • Memory locality: Vectorized slice operations process contiguous memory blocks more efficiently than individual element access
  • Reduced indexing overhead: PyTorch tensor indexing has per-operation costs that are minimized by batching operations
  • CPU cache efficiency: Accessing tensor slices in larger chunks improves cache hit rates

Performance Impact:
The function is called from _format_ground_truth_annotations_for_detection in a loop over bounding boxes, making these micro-optimizations meaningful. Test results show consistent 5-12% improvements across various box configurations, with the largest gains (12.3%) occurring in batch processing scenarios with 1000+ boxes where the vectorization benefits compound.

Best suited for: Workloads with multiple bounding boxes per image and batch processing scenarios common in object detection pipelines.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 2038 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
# imports
import pytest
import torch
from ultralytics.utils.callbacks.comet import _scale_bounding_box_to_original_image_shape

# unit tests

# ----------- BASIC TEST CASES -----------


def test_basic_identity_scaling():
    # No scaling, no padding, bounding box should remain the same
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])  # normalized center x/y, width/height
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 167μs -> 158μs (5.54% faster)


def test_basic_scaling_up():
    # Resize from 100x100 to 200x200, box should scale up
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (200, 200)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 165μs -> 158μs (4.34% faster)


def test_basic_scaling_down():
    # Resize from 200x200 to 100x100, box should scale down
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (200, 200)
    original_shape = (100, 100)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 163μs -> 154μs (5.83% faster)


def test_basic_non_square_scaling():
    # Non-square image scaling
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (120, 80)
    original_shape = (60, 40)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 168μs -> 158μs (6.30% faster)


def test_basic_with_padding():
    # Test with explicit padding
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (80, 80)
    # Simulate ratio_pad as in YOLO letterbox: gain=1.25, pad=(10,10)
    ratio_pad = ((1.25,), (10, 10))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 163μs -> 150μs (8.94% faster)


# ----------- EDGE TEST CASES -----------


def test_edge_zero_width_height():
    # Box with zero width and height
    box = torch.tensor([0.5, 0.5, 0.0, 0.0])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 166μs -> 151μs (10.0% faster)


def test_edge_box_outside_image():
    # Box with center outside image (normalized > 1)
    box = torch.tensor([1.2, 1.2, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 168μs -> 155μs (8.37% faster)


def test_edge_box_negative_coordinates():
    # Box with negative normalized coordinates
    box = torch.tensor([-0.2, -0.2, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 169μs -> 154μs (9.79% faster)


def test_edge_minimum_size_image():
    # Minimum possible image size
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (1, 1)
    original_shape = (1, 1)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 163μs -> 158μs (3.01% faster)


def test_edge_large_padding():
    # Large padding, box should be clipped
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (100, 100))  # pad larger than image
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 167μs -> 155μs (8.00% faster)


def test_edge_non_tensor_input():
    # Should raise if box is not a tensor
    box = [0.5, 0.5, 0.2, 0.2]
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = None
    with pytest.raises(AttributeError):
        _scale_bounding_box_to_original_image_shape(
            box, resized_shape, original_shape, ratio_pad
        )  # 1.74μs -> 1.76μs (1.08% slower)


def test_edge_invalid_box_shape():
    # Should raise if box shape is not (4,)
    box = torch.tensor([0.5, 0.5, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = None
    with pytest.raises(AssertionError):
        _scale_bounding_box_to_original_image_shape(
            box, resized_shape, original_shape, ratio_pad
        )  # 4.28μs -> 4.46μs (3.99% slower)


# ----------- LARGE SCALE TEST CASES -----------


def test_large_scale_many_boxes():
    # Test with 1000 boxes in a batch (simulate vectorized call)
    boxes = torch.stack([torch.tensor([0.5, 0.5, 0.2, 0.2])] * 1000)
    resized_shape = (100, 100)
    original_shape = (200, 200)
    ratio_pad = None
    # Call function on each box, check output
    for box in boxes:
        codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
        out = codeflash_output  # 115ms -> 103ms (12.2% faster)


def test_large_scale_extreme_image_size():
    # Test with large image size, but tensor size < 100MB
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (4096, 4096)
    original_shape = (2048, 2048)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 187μs -> 174μs (7.83% faster)


def test_large_scale_extreme_aspect_ratio():
    # Test with extreme aspect ratio
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (1000, 100)
    original_shape = (500, 50)
    ratio_pad = None
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 172μs -> 157μs (9.46% faster)


def test_large_scale_random_boxes():
    # Test with random boxes and shapes
    import random

    for _ in range(10):
        h0 = random.randint(50, 500)
        w0 = random.randint(50, 500)
        h1 = random.randint(50, 500)
        w1 = random.randint(50, 500)
        cx = random.uniform(0, 1)
        cy = random.uniform(0, 1)
        bw = random.uniform(0, 1)
        bh = random.uniform(0, 1)
        box = torch.tensor([cx, cy, bw, bh])
        codeflash_output = _scale_bounding_box_to_original_image_shape(box, (h1, w1), (h0, w0), None)
        out = codeflash_output  # 1.25ms -> 1.14ms (10.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
import torch
from ultralytics.utils.callbacks.comet import _scale_bounding_box_to_original_image_shape

# -------------------- UNIT TESTS --------------------

# Basic Test Cases


def test_basic_center_box_no_padding():
    # Test a centered box in a 100x100 image, no padding, no scaling
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])  # normalized xywh
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 175μs -> 165μs (6.39% faster)


def test_basic_corner_box():
    # Test a box in the top left corner
    box = torch.tensor([0.1, 0.1, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 166μs -> 153μs (8.41% faster)


def test_basic_box_with_padding():
    # Test a box with padding applied
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (120, 100)
    original_shape = (100, 100)
    # Padding is (0, 10) (width, height)
    ratio_pad = ((1.0,), (0, 10))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 160μs -> 155μs (3.41% faster)


def test_basic_box_with_scaling():
    # Test a box with scaling applied
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (200, 200)
    original_shape = (100, 100)
    ratio_pad = ((2.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 165μs -> 149μs (10.8% faster)


# Edge Test Cases


def test_zero_size_box():
    # Test a box with zero width and height
    box = torch.tensor([0.5, 0.5, 0.0, 0.0])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 162μs -> 153μs (5.66% faster)


def test_box_outside_image():
    # Test a box that is partially outside the image
    box = torch.tensor([1.1, 1.1, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 168μs -> 157μs (7.15% faster)


def test_box_at_image_border():
    # Test a box at the right-bottom border
    box = torch.tensor([0.9, 0.9, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 167μs -> 154μs (8.04% faster)


def test_non_square_images():
    # Test with different aspect ratios
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (200, 100)
    original_shape = (100, 200)
    ratio_pad = ((0.5,), (0, 0))  # gain = 0.5
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 165μs -> 152μs (8.99% faster)


def test_negative_coordinates():
    # Test a box with negative coordinates (normalized)
    box = torch.tensor([-0.1, -0.1, 0.2, 0.2])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 164μs -> 154μs (6.31% faster)


def test_extreme_aspect_ratio():
    # Test with extreme aspect ratio
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (1000, 10)
    original_shape = (10, 1000)
    ratio_pad = ((0.01,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 160μs -> 147μs (8.69% faster)


# Large Scale Test Cases


def test_many_boxes():
    # Test with 1000 boxes
    boxes = torch.stack([torch.tensor([i / 1000, i / 1000, 0.1, 0.1]) for i in range(1000)])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    results = []
    for box in boxes:
        codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
        out = codeflash_output  # 115ms -> 102ms (12.3% faster)
        results.append(out)
    # All boxes should be inside the image bounds
    for out in results:
        pass


def test_large_image():
    # Test with a large image shape
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (1000, 1000)
    original_shape = (1000, 1000)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 178μs -> 168μs (5.73% faster)


def test_large_scale_with_padding_and_scaling():
    # Test with large image, padding and scaling
    box = torch.tensor([0.5, 0.5, 0.2, 0.2])
    resized_shape = (800, 600)
    original_shape = (400, 300)
    ratio_pad = ((2.0,), (10, 20))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 165μs -> 156μs (5.89% faster)


def test_all_zeros_box():
    # Test a box with all zeros
    box = torch.tensor([0.0, 0.0, 0.0, 0.0])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 158μs -> 149μs (5.95% faster)


def test_all_ones_box():
    # Test a box with all ones (should be clipped to image bounds)
    box = torch.tensor([1.0, 1.0, 1.0, 1.0])
    resized_shape = (100, 100)
    original_shape = (100, 100)
    ratio_pad = ((1.0,), (0, 0))
    codeflash_output = _scale_bounding_box_to_original_image_shape(box, resized_shape, original_shape, ratio_pad)
    out = codeflash_output  # 161μs -> 151μs (6.74% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_scale_bounding_box_to_original_image_shape-mira3nws and push.

Codeflash Static Badge

The optimizations achieve a **12% speedup** by improving memory access patterns and reducing computational overhead in tensor operations:

**Key Optimizations:**

1. **`xyxy2xywh` function (35% faster)**: Replaced four individual tensor assignments with two vectorized operations using slice notation:
   - `y[..., :2] = (x[..., :2] + x[..., 2:]) / 2` computes both center coordinates in one operation
   - `y[..., 2:] = x[..., 2:] - x[..., :2]` computes both width/height in one operation
   - This reduces tensor indexing overhead from 4 operations to 2 and improves memory locality

2. **`xywhn2xyxy` function (7% faster)**: Extracted repeated tensor indexing into intermediate variables:
   - `c_x = x[..., 0]`, `c_y = x[..., 1]`, etc. avoid redundant indexing operations
   - Each coordinate calculation now references the cached values instead of re-indexing the tensor

**Why This Works:**
- **Memory locality**: Vectorized slice operations process contiguous memory blocks more efficiently than individual element access
- **Reduced indexing overhead**: PyTorch tensor indexing has per-operation costs that are minimized by batching operations
- **CPU cache efficiency**: Accessing tensor slices in larger chunks improves cache hit rates

**Performance Impact:**
The function is called from `_format_ground_truth_annotations_for_detection` in a loop over bounding boxes, making these micro-optimizations meaningful. Test results show consistent 5-12% improvements across various box configurations, with the largest gains (12.3%) occurring in batch processing scenarios with 1000+ boxes where the vectorization benefits compound.

**Best suited for:** Workloads with multiple bounding boxes per image and batch processing scenarios common in object detection pipelines.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 4, 2025 10:14
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant