Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 4, 2025

📄 13% (0.13x) speedup for PoseValidator._prepare_pred in ultralytics/models/yolo/pose/val.py

⏱️ Runtime : 3.09 milliseconds 2.75 milliseconds (best of 88 runs)

📝 Explanation and details

The optimized code achieves a 12% speedup through targeted PyTorch tensor operation optimizations, primarily in the coordinate scaling functions that are critical bottlenecks during YOLO model validation.

Key Optimizations Applied:

  1. In-place Tensor Operations in scale_coords: Replaced standard arithmetic operations (-=, /=) with PyTorch's optimized in-place methods (.sub_(), .div_()). This eliminates intermediate tensor allocations and leverages PyTorch's internal kernels for better memory efficiency.

  2. Optimized Padding Calculations: In scale_boxes, the padding calculation was restructured to separate width/height computations (pad_w, pad_h) and create the tuple once, reducing redundant arithmetic operations.

  3. Streamlined View Operations: In PoseValidator._prepare_pred, replaced len(predn) with predn.size(0) for the tensor view operation, which is a more direct tensor method call that avoids Python overhead.

Performance Impact:
The line profiler shows the most significant gains in scale_coords where in-place operations reduce execution time from ~1.1ms to ~1.0ms. The coordinate subtraction and division operations show 15-30% improvements in per-hit timing, which compounds across the many tensor operations during validation.

Test Case Benefits:
The optimizations are particularly effective for:

  • Large batch processing (500 detections with 20 keypoints: 9.4% faster)
  • High keypoint counts (100 keypoints per detection: 12.6% faster)
  • Standard pose estimation workloads (2-17 keypoints: 10-20% faster)

These micro-optimizations are especially valuable in YOLO validation pipelines where _prepare_pred is called repeatedly for every detection batch, making the cumulative performance gain significant for model evaluation and training workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 81 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import torch
from ultralytics.models.yolo.pose.val import PoseValidator


def scale_boxes(img1_shape, boxes, img0_shape, ratio_pad=None, padding=True, xywh=False):
    # Rescale bounding boxes from img1_shape to img0_shape.
    if ratio_pad is None:
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
        pad = (
            round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1),
            round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1),
        )
    else:
        gain = ratio_pad[0][0]
        pad = ratio_pad[1]
    if padding:
        boxes[..., 0] -= pad[0]
        boxes[..., 1] -= pad[1]
        if not xywh:
            boxes[..., 2] -= pad[0]
            boxes[..., 3] -= pad[1]
    boxes[..., :4] /= gain
    # Clip to image boundaries
    boxes[..., 0] = boxes[..., 0].clamp(0, img0_shape[1])
    boxes[..., 1] = boxes[..., 1].clamp(0, img0_shape[0])
    boxes[..., 2] = boxes[..., 2].clamp(0, img0_shape[1])
    boxes[..., 3] = boxes[..., 3].clamp(0, img0_shape[0])
    return boxes


class DummySuper:
    # Simulate DetectionValidator._prepare_pred
    def _prepare_pred(self, pred, pbatch):
        predn = pred.clone()
        scale_boxes(pbatch["imgsz"], predn[:, :4], pbatch["ori_shape"], ratio_pad=pbatch["ratio_pad"])
        return predn


# ----------- UNIT TESTS ------------

# ----------- BASIC TEST CASES ------------


def test_basic_single_prediction():
    # Test with a single prediction, 2 keypoints
    pred = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9, 1, 50.0, 60.0, 70.0, 80.0]])  # [x1,y1,x2,y2,conf,class,kpts...]
    pbatch = {"imgsz": (100, 200), "ori_shape": (50, 100), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 2, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 132μs -> 118μs (11.5% faster)


def test_basic_batch_predictions():
    # Test with a batch of predictions, 3 predictions, 3 keypoints
    pred = torch.tensor(
        [
            [0.0, 0.0, 10.0, 10.0, 0.7, 0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0],
            [10.0, 20.0, 30.0, 40.0, 0.8, 1, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0],
            [20.0, 30.0, 40.0, 50.0, 0.9, 2, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0],
        ]
    )
    pbatch = {"imgsz": (200, 400), "ori_shape": (100, 200), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 3, 3)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 133μs -> 117μs (13.7% faster)


def test_basic_no_padding():
    # Test with ratio_pad=None (no explicit padding)
    pred = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9, 1, 50.0, 60.0, 70.0, 80.0]])
    pbatch = {"imgsz": (100, 200), "ori_shape": (50, 100), "ratio_pad": None, "kpts": torch.zeros(1, 2, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 134μs -> 119μs (12.4% faster)


# ----------- EDGE TEST CASES ------------


def test_edge_empty_prediction():
    # Test with empty predictions
    pred = torch.empty((0, 10))
    pbatch = {"imgsz": (100, 200), "ori_shape": (50, 100), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 2, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)


def test_edge_single_keypoint():
    # Test with a single keypoint (nk=1)
    pred = torch.tensor([[1.0, 2.0, 3.0, 4.0, 0.5, 0, 5.0, 6.0]])
    pbatch = {"imgsz": (100, 100), "ori_shape": (50, 50), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 1, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 136μs -> 121μs (11.9% faster)


def test_edge_large_padding():
    # Test with large padding values
    pred = torch.tensor([[100.0, 200.0, 300.0, 400.0, 0.9, 1, 500.0, 600.0, 700.0, 800.0]])
    pbatch = {
        "imgsz": (1000, 2000),
        "ori_shape": (500, 1000),
        "ratio_pad": ((2.0,), (100, 200)),
        "kpts": torch.zeros(1, 2, 2),
    }
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 129μs -> 116μs (11.4% faster)


def test_edge_negative_coords():
    # Test with negative coordinates in prediction
    pred = torch.tensor([[-10.0, -20.0, -30.0, -40.0, 0.9, 1, -50.0, -60.0, -70.0, -80.0]])
    pbatch = {"imgsz": (100, 200), "ori_shape": (50, 100), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 2, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 129μs -> 118μs (9.78% faster)


def test_edge_high_coords():
    # Test with coordinates exceeding image size
    pred = torch.tensor([[1000.0, 2000.0, 3000.0, 4000.0, 0.9, 1, 5000.0, 6000.0, 7000.0, 8000.0]])
    pbatch = {"imgsz": (100, 200), "ori_shape": (50, 100), "ratio_pad": ((2.0,), (0, 0)), "kpts": torch.zeros(1, 2, 2)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 131μs -> 116μs (13.5% faster)


def test_edge_zero_gain():
    # Test with gain=0 (should not crash, but will produce inf or nan)
    pred = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9, 1, 50.0, 60.0, 70.0, 80.0]])
    pbatch = {
        "imgsz": (100, 200),
        "ori_shape": (0, 0),  # This will cause gain=inf
        "ratio_pad": None,
        "kpts": torch.zeros(1, 2, 2),
    }
    pv = PoseValidator()
    try:
        predn, pred_kpts = pv._prepare_pred(pred, pbatch)
    except Exception:
        pass  # Accept that this may raise due to zero division, but should not crash Python


# ----------- LARGE SCALE TEST CASES ------------


def test_large_batch():
    # Test with a large batch of predictions
    batch_size = 500
    nk = 17
    dim = 3
    pred = torch.cat(
        [
            torch.rand(batch_size, 6),  # [x1,y1,x2,y2,conf,class]
            torch.rand(batch_size, nk * dim),
        ],
        dim=1,
    )
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0,), (0, 0)),
        "kpts": torch.zeros(1, nk, dim),
    }
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 183μs -> 166μs (10.7% faster)


def test_large_keypoints():
    # Test with a large number of keypoints per prediction
    batch_size = 10
    nk = 100
    dim = 2
    pred = torch.cat([torch.rand(batch_size, 6), torch.rand(batch_size, nk * dim)], dim=1)
    pbatch = {
        "imgsz": (1280, 720),
        "ori_shape": (640, 360),
        "ratio_pad": ((2.0,), (0, 0)),
        "kpts": torch.zeros(1, nk, dim),
    }
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 138μs -> 123μs (12.6% faster)


def test_large_scale_no_padding():
    # Large scale test with ratio_pad=None
    batch_size = 100
    nk = 17
    dim = 3
    pred = torch.cat([torch.rand(batch_size, 6), torch.rand(batch_size, nk * dim)], dim=1)
    pbatch = {"imgsz": (640, 480), "ori_shape": (320, 240), "ratio_pad": None, "kpts": torch.zeros(1, nk, dim)}
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 144μs -> 131μs (9.85% faster)


def test_large_scale_extreme_values():
    # Large scale test with extreme values
    batch_size = 50
    nk = 10
    dim = 2
    pred = torch.cat([torch.full((batch_size, 6), 1e6), torch.full((batch_size, nk * dim), 1e6)], dim=1)
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0,), (0, 0)),
        "kpts": torch.zeros(1, nk, dim),
    }
    pv = PoseValidator()
    predn, pred_kpts = pv._prepare_pred(pred, pbatch)  # 135μs -> 120μs (12.7% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
import torch
from ultralytics.models.yolo.pose.val import PoseValidator


# --- Minimal mock implementation of ops module for testing ---
class OpsMock:
    @staticmethod
    def scale_boxes(img1_shape, boxes, img0_shape, ratio_pad=None, padding=True, xywh=False):
        # Simulate the actual function: scale boxes from img1_shape to img0_shape
        # For test, we just scale by ratio and pad if provided, otherwise do nothing
        # We will implement a simplified version that matches the contract for the test
        if ratio_pad is None:
            gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
            pad = (
                round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1),
                round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1),
            )
        else:
            gain = ratio_pad[0][0]
            pad = ratio_pad[1]
        if padding:
            boxes[..., 0] -= pad[0]
            boxes[..., 1] -= pad[1]
            if not xywh:
                boxes[..., 2] -= pad[0]
                boxes[..., 3] -= pad[1]
        boxes[..., :4] /= gain
        # Clip to img0_shape
        boxes[..., 0] = boxes[..., 0].clamp(0, img0_shape[1])
        boxes[..., 1] = boxes[..., 1].clamp(0, img0_shape[0])
        boxes[..., 2] = boxes[..., 2].clamp(0, img0_shape[1])
        boxes[..., 3] = boxes[..., 3].clamp(0, img0_shape[0])
        return boxes

    @staticmethod
    def scale_coords(img1_shape, coords, img0_shape, ratio_pad=None, normalize=False, padding=True):
        # Simulate scaling of keypoints
        if ratio_pad is None:
            gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
            pad = (
                (img1_shape[1] - img0_shape[1] * gain) / 2,
                (img1_shape[0] - img0_shape[0] * gain) / 2,
            )
        else:
            gain = ratio_pad[0][0]
            pad = ratio_pad[1]
        if padding:
            coords[..., 0] -= pad[0]
            coords[..., 1] -= pad[1]
        coords[..., 0] /= gain
        coords[..., 1] /= gain
        coords[..., 0] = coords[..., 0].clamp(0, img0_shape[1])
        coords[..., 1] = coords[..., 1].clamp(0, img0_shape[0])
        if normalize:
            coords[..., 0] /= img0_shape[1]
            coords[..., 1] /= img0_shape[0]
        return coords


ops = OpsMock

# --- Test suite for PoseValidator._prepare_pred ---


@pytest.fixture
def basic_pbatch():
    # Basic batch with 2 keypoints per detection, 1 detection
    return {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((1, 2, 3)),  # 1 detection, 2 keypoints, 3 values (x, y, conf)
    }


def make_pred(num_det, num_kpts):
    # Each detection: [x1, y1, x2, y2, conf, class, kpt1x, kpt1y, kpt1c, kpt2x, kpt2y, kpt2c, ...]
    # For test, fill with increasing numbers
    box = torch.tensor([[10.0, 20.0, 30.0, 40.0, 0.9, 1.0] for _ in range(num_det)])
    kpts = torch.arange(num_det * num_kpts * 3, dtype=torch.float32).reshape(num_det, num_kpts * 3) + 50
    return torch.cat([box, kpts], dim=1)


# ----------------------
# 1. Basic Test Cases
# ----------------------


def test_basic_single_detection_two_kpts(basic_pbatch):
    # One detection, two keypoints
    pred = make_pred(1, 2)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, basic_pbatch)  # 123μs -> 109μs (13.2% faster)


def test_basic_multiple_detections_multiple_kpts():
    # Three detections, four keypoints each
    num_det, num_kpts = 3, 4
    pbatch = {
        "imgsz": (640, 640),
        "ori_shape": (320, 320),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 123μs -> 106μs (15.7% faster)


def test_basic_no_padding():
    # Test with no padding and ratio_pad None
    num_det, num_kpts = 2, 3
    pbatch = {
        "imgsz": (320, 320),
        "ori_shape": (320, 320),
        "ratio_pad": None,
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 125μs -> 110μs (13.8% faster)


# ----------------------
# 2. Edge Test Cases
# ----------------------


def test_zero_detections():
    # No detections
    num_det, num_kpts = 0, 5
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = torch.zeros((0, 6 + num_kpts * 3))
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)


def test_zero_keypoints():
    # Detections with zero keypoints
    num_det, num_kpts = 2, 0
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = torch.zeros((num_det, 6))  # No keypoints
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)


def test_large_coordinates():
    # Test with very large box and kpt coordinates
    num_det, num_kpts = 1, 2
    pbatch = {
        "imgsz": (10000, 10000),
        "ori_shape": (5000, 5000),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = torch.cat(
        [
            torch.tensor([[10000.0, 20000.0, 30000.0, 40000.0, 0.9, 1.0]]),
            torch.tensor([[50000.0, 60000.0, 0.5, 70000.0, 80000.0, 0.8]]),
        ],
        dim=1,
    )
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 135μs -> 119μs (12.8% faster)


def test_negative_coordinates():
    # Test with negative coordinates (should be clipped to 0)
    num_det, num_kpts = 1, 2
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = torch.cat(
        [
            torch.tensor([[-10.0, -20.0, -30.0, -40.0, 0.9, 1.0]]),
            torch.tensor([[-50.0, -60.0, 0.5, -70.0, -80.0, 0.8]]),
        ],
        dim=1,
    )
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 128μs -> 116μs (10.2% faster)


def test_non_square_images():
    # Test with non-square images and different aspect ratios
    num_det, num_kpts = 2, 3
    pbatch = {
        "imgsz": (800, 400),
        "ori_shape": (400, 200),
        "ratio_pad": ((2.0, 2.0), (10, 20)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 121μs -> 106μs (14.3% faster)


def test_single_keypoint():
    # Test with only one keypoint per detection
    num_det, num_kpts = 2, 1
    pbatch = {
        "imgsz": (320, 320),
        "ori_shape": (320, 320),
        "ratio_pad": None,
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 124μs -> 109μs (13.3% faster)


def test_float_and_int_types():
    # Test with float32 and float64 types
    num_det, num_kpts = 2, 2
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts).double()
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 116μs -> 104μs (12.4% faster)


# ----------------------
# 3. Large Scale Test Cases
# ----------------------


def test_large_batch_and_keypoints():
    # Test with 100 detections, 10 keypoints each (max 100*10*3*4 = 12KB)
    num_det, num_kpts = 100, 10
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 129μs -> 111μs (15.6% faster)


def test_maximum_allowed_tensor_size():
    # Test with large but <100MB tensor: 500 detections, 20 keypoints each (500*20*3*4 = 120KB)
    num_det, num_kpts = 500, 20
    pbatch = {
        "imgsz": (1280, 960),
        "ori_shape": (640, 480),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 174μs -> 159μs (9.41% faster)


def test_varying_keypoint_confidences():
    # Test with keypoint confidences varying from 0 to 1
    num_det, num_kpts = 10, 5
    pbatch = {
        "imgsz": (640, 480),
        "ori_shape": (320, 240),
        "ratio_pad": ((2.0, 2.0), (0, 0)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    # Set confidences to linspace 0..1
    for i in range(num_kpts):
        pred[:, 6 + i * 3 + 2] = torch.linspace(0, 1, num_det)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 117μs -> 97.8μs (19.9% faster)


def test_extreme_ratio_pad_and_gain():
    # Test with extreme ratio_pad values (simulate heavy letterboxing)
    num_det, num_kpts = 3, 2
    pbatch = {
        "imgsz": (1000, 500),
        "ori_shape": (100, 50),
        "ratio_pad": ((10.0, 10.0), (100, 200)),
        "kpts": torch.zeros((num_det, num_kpts, 3)),
    }
    pred = make_pred(num_det, num_kpts)
    validator = PoseValidator()
    predn, pred_kpts = validator._prepare_pred(pred, pbatch)  # 121μs -> 104μs (15.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-PoseValidator._prepare_pred-mirgg04h and push.

Codeflash Static Badge

The optimized code achieves a **12% speedup** through targeted PyTorch tensor operation optimizations, primarily in the coordinate scaling functions that are critical bottlenecks during YOLO model validation.

**Key Optimizations Applied:**

1. **In-place Tensor Operations in `scale_coords`**: Replaced standard arithmetic operations (`-=`, `/=`) with PyTorch's optimized in-place methods (`.sub_()`, `.div_()`). This eliminates intermediate tensor allocations and leverages PyTorch's internal kernels for better memory efficiency.

2. **Optimized Padding Calculations**: In `scale_boxes`, the padding calculation was restructured to separate width/height computations (`pad_w`, `pad_h`) and create the tuple once, reducing redundant arithmetic operations.

3. **Streamlined View Operations**: In `PoseValidator._prepare_pred`, replaced `len(predn)` with `predn.size(0)` for the tensor view operation, which is a more direct tensor method call that avoids Python overhead.

**Performance Impact:**
The line profiler shows the most significant gains in `scale_coords` where in-place operations reduce execution time from ~1.1ms to ~1.0ms. The coordinate subtraction and division operations show 15-30% improvements in per-hit timing, which compounds across the many tensor operations during validation.

**Test Case Benefits:**
The optimizations are particularly effective for:
- Large batch processing (500 detections with 20 keypoints: 9.4% faster)
- High keypoint counts (100 keypoints per detection: 12.6% faster) 
- Standard pose estimation workloads (2-17 keypoints: 10-20% faster)

These micro-optimizations are especially valuable in YOLO validation pipelines where `_prepare_pred` is called repeatedly for every detection batch, making the cumulative performance gain significant for model evaluation and training workflows.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 4, 2025 13:11
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant