⚡️ Speed up function `_merge_tiles_elements` by 45% #786

codeflash-ai · 2025-12-03T22:54:11Z

📄 45% (0.45x) speedup for `_merge_tiles_elements` in `inference/core/utils/drawing.py`

⏱️ Runtime : 2.36 milliseconds → 1.63 milliseconds (best of 48 runs)

📝 Explanation and details

The optimized code achieves a 45% speedup through several key performance improvements:

Main Optimizations:

Replaced np.ones() * color with np.full(): The original code used np.ones((shape)) * color which creates an array of ones and then multiplies by the color. The optimized version uses np.full(shape, color, dtype=np.uint8) which directly creates the array with the desired values and correct dtype, eliminating the multiplication step and ensuring proper data type from creation.
Eliminated expensive itertools operations: The original code used itertools.chain.from_iterable(zip(row, [vertical_padding] * grid_size[1])) which creates multiple intermediate objects and performs complex chaining. The optimized version uses direct list slicing (row_with_paddings[::2] = row and row_with_paddings[1::2] = vertical_padding_row[:-1]) to interleave tiles and padding more efficiently.
Replaced np.concatenate with np.vstack: For stacking arrays along the first axis (rows), np.vstack is more optimized than np.concatenate(axis=0) and uses less memory overhead.
Improved list construction: Instead of repeatedly appending to lists in loops, the optimized version preallocates list sizes where possible and uses more efficient indexing operations.

Performance Impact by Test Cases:

Large grids benefit most: Tests with 20x20+ grids show 42-84% speedups, indicating the optimizations scale well with grid size
Large tiles also benefit significantly: The 100x100 tile test shows 277% speedup, suggesting memory allocation improvements are substantial
Even small cases improve: Single tile operations still gain 15-20% speedup

Function Usage Context:
Based on the function reference, _merge_tiles_elements is called from _generate_tiles, which appears to be part of an image tiling/visualization pipeline. This suggests the function is likely used for creating composite images from multiple smaller images, potentially in batch processing scenarios where the performance gains would compound across multiple calls.

The optimizations are particularly valuable for computer vision workflows where large numbers of images need to be arranged in grids with margins, as the improvements scale with both grid size and individual tile size.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 30 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import itertools
from typing import List, Tuple

import numpy as np

# imports
import pytest
from inference.core.utils.drawing import _merge_tiles_elements

# unit tests

# --- Basic Test Cases ---


def test_single_tile_no_margin():
    # 1x1 grid, no margin
    tile = np.ones((10, 10, 3), dtype=np.uint8) * 50
    codeflash_output = _merge_tiles_elements([[tile]], (1, 1), (10, 10), 0, (0, 0, 0))
    result = codeflash_output  # 18.5μs -> 15.5μs (19.7% faster)


def test_2x2_grid_with_margin():
    # 2x2 grid, tile size 5x5, margin 2, margin color red
    tile = np.ones((5, 5, 3), dtype=np.uint8) * 100
    tiles = [[tile, tile], [tile, tile]]
    codeflash_output = _merge_tiles_elements(tiles, (2, 2), (5, 5), 2, (255, 0, 0))
    result = codeflash_output  # 24.4μs -> 17.7μs (37.6% faster)


def test_different_margin_color():
    # 2x1 grid, blue margin
    tile = np.ones((4, 4, 3), dtype=np.uint8) * 10
    tiles = [[tile], [tile]]
    codeflash_output = _merge_tiles_elements(tiles, (2, 1), (4, 4), 3, (0, 0, 255))
    result = codeflash_output  # 20.3μs -> 16.2μs (25.3% faster)


def test_zero_sized_tile():
    # 1x1 grid, tile size 0x0
    tile = np.zeros((0, 0, 3), dtype=np.uint8)
    codeflash_output = _merge_tiles_elements(
        [[tile]], (1, 1), (0, 0), 2, (123, 222, 111)
    )
    result = codeflash_output  # 35.8μs -> 25.8μs (39.0% faster)


def test_zero_margin():
    # 2x2 grid, margin 0
    tile = np.ones((3, 3, 3), dtype=np.uint8) * 42
    tiles = [[tile, tile], [tile, tile]]
    codeflash_output = _merge_tiles_elements(tiles, (2, 2), (3, 3), 0, (1, 1, 1))
    result = codeflash_output  # 23.9μs -> 19.6μs (21.9% faster)


def test_non_uint8_tiles():
    # Tiles in float, margin color float
    tile = np.ones((2, 2, 3), dtype=np.float32) * 0.5
    tiles = [[tile, tile]]
    codeflash_output = _merge_tiles_elements(tiles, (1, 2), (2, 2), 1, (0.2, 0.3, 0.4))
    result = codeflash_output  # 22.5μs -> 17.5μs (28.6% faster)


def test_margin_larger_than_tile():
    # Margin larger than tile size
    tile = np.ones((2, 2, 3), dtype=np.uint8) * 7
    tiles = [[tile, tile]]
    codeflash_output = _merge_tiles_elements(tiles, (1, 2), (2, 2), 5, (9, 8, 7))
    result = codeflash_output  # 20.9μs -> 15.5μs (35.2% faster)


def test_empty_tiles_elements():
    # Empty tiles_elements
    with pytest.raises(IndexError):
        _merge_tiles_elements(
            [], (0, 0), (0, 0), 1, (0, 0, 0)
        )  # 11.0μs -> 6.47μs (69.4% faster)


def test_inconsistent_tile_shapes():
    # Tiles of different shapes
    tile1 = np.ones((2, 2, 3), dtype=np.uint8)
    tile2 = np.ones((3, 2, 3), dtype=np.uint8)
    tiles = [[tile1, tile2]]
    # Should raise ValueError due to shape mismatch in np.concatenate
    with pytest.raises(ValueError):
        _merge_tiles_elements(
            tiles, (1, 2), (2, 2), 1, (1, 1, 1)
        )  # 16.2μs -> 9.01μs (79.8% faster)


def test_large_grid_and_tiles():
    # 20x20 grid, tile size 10x10, margin 2
    tile = np.ones((10, 10, 3), dtype=np.uint8) * 200
    tiles = [[tile for _ in range(20)] for _ in range(20)]
    codeflash_output = _merge_tiles_elements(tiles, (20, 20), (10, 10), 2, (0, 255, 0))
    result = codeflash_output  # 386μs -> 209μs (84.3% faster)


def test_large_margin_color():
    # Large margin color values (overflow check)
    tile = np.ones((5, 5, 3), dtype=np.uint8) * 10
    tiles = [[tile, tile], [tile, tile]]
    # margin color > 255, should wrap or clip to uint8
    codeflash_output = _merge_tiles_elements(tiles, (2, 2), (5, 5), 1, (300, 400, 500))
    result = codeflash_output  # 31.5μs -> 21.0μs (49.6% faster)


def test_maximum_grid_size():
    # 50x20 grid, tile size 2x2, margin 1
    tile = np.ones((2, 2, 3), dtype=np.uint8) * 123
    tiles = [[tile for _ in range(20)] for _ in range(50)]
    codeflash_output = _merge_tiles_elements(tiles, (50, 20), (2, 2), 1, (1, 1, 1))
    result = codeflash_output  # 471μs -> 372μs (26.5% faster)


def test_large_tiles_elements_with_random_values():
    # 10x10 grid, each tile random
    rng = np.random.default_rng(42)
    tiles = [
        [rng.integers(0, 256, (3, 3, 3), dtype=np.uint8) for _ in range(10)]
        for _ in range(10)
    ]
    codeflash_output = _merge_tiles_elements(tiles, (10, 10), (3, 3), 2, (10, 20, 30))
    result = codeflash_output  # 82.9μs -> 61.2μs (35.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import itertools
from typing import List, Tuple

import numpy as np

# imports
import pytest  # used for our unit tests
from inference.core.utils.drawing import _merge_tiles_elements

# unit tests


# Helper function to create a colored tile
def make_tile(height, width, color):
    arr = np.ones((height, width, 3), dtype=np.uint8)
    arr *= np.array(color, dtype=np.uint8)
    return arr


# --------------- BASIC TEST CASES ---------------


def test_single_tile_no_margin():
    # 1x1 grid, no margin
    tile = make_tile(10, 20, (100, 150, 200))
    codeflash_output = _merge_tiles_elements(
        [[tile]],
        grid_size=(1, 1),
        single_tile_size=(20, 10),
        tile_margin=0,
        tile_margin_color=(0, 0, 0),
    )
    result = codeflash_output  # 24.2μs -> 21.1μs (14.8% faster)


def test_two_by_two_grid_with_margin():
    # 2x2 grid, margin=2, margin color=(255,0,0)
    tile = make_tile(5, 5, (0, 255, 0))
    tiles_elements = [[tile, tile], [tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(2, 2),
        single_tile_size=(5, 5),
        tile_margin=2,
        tile_margin_color=(255, 0, 0),
    )
    result = codeflash_output  # 26.3μs -> 19.9μs (32.5% faster)


def test_non_square_grid():
    # 1x3 grid, margin=1, margin color=(0,0,255)
    tile = make_tile(4, 4, (50, 50, 50))
    tiles_elements = [[tile, tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(1, 3),
        single_tile_size=(4, 4),
        tile_margin=1,
        tile_margin_color=(0, 0, 255),
    )
    result = codeflash_output  # 22.4μs -> 16.4μs (36.1% faster)


def test_different_tile_colors():
    # 2x2 grid, each tile a different color
    tiles_elements = [
        [make_tile(2, 2, (10, 20, 30)), make_tile(2, 2, (40, 50, 60))],
        [make_tile(2, 2, (70, 80, 90)), make_tile(2, 2, (100, 110, 120))],
    ]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(2, 2),
        single_tile_size=(2, 2),
        tile_margin=1,
        tile_margin_color=(1, 2, 3),
    )
    result = codeflash_output  # 22.8μs -> 17.6μs (29.5% faster)


# --------------- EDGE TEST CASES ---------------


def test_zero_sized_tile():
    # Tile with zero height
    tile = make_tile(0, 5, (1, 2, 3))
    tiles_elements = [[tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(1, 1),
        single_tile_size=(5, 0),
        tile_margin=2,
        tile_margin_color=(1, 2, 3),
    )
    result = codeflash_output  # 18.3μs -> 15.0μs (21.9% faster)


def test_zero_margin():
    # No margin between tiles
    tile = make_tile(2, 2, (7, 8, 9))
    tiles_elements = [[tile, tile], [tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(2, 2),
        single_tile_size=(2, 2),
        tile_margin=0,
        tile_margin_color=(0, 0, 0),
    )
    result = codeflash_output  # 21.4μs -> 17.4μs (23.2% faster)


def test_maximum_margin_color():
    # Margin color at max value
    tile = make_tile(1, 1, (0, 0, 0))
    tiles_elements = [[tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(1, 2),
        single_tile_size=(1, 1),
        tile_margin=1,
        tile_margin_color=(255, 255, 255),
    )
    result = codeflash_output  # 19.5μs -> 15.0μs (30.0% faster)


def test_non_uint8_tile_input():
    # Tiles with float type
    tile = np.ones((2, 2, 3), dtype=np.float32) * 123.4
    tiles_elements = [[tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(1, 2),
        single_tile_size=(2, 2),
        tile_margin=1,
        tile_margin_color=(200, 100, 50),
    )
    result = codeflash_output  # 21.3μs -> 16.4μs (30.0% faster)


def test_single_column_grid():
    # 3x1 grid, margin=1
    tile = make_tile(2, 2, (100, 100, 100))
    tiles_elements = [[tile], [tile], [tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(3, 1),
        single_tile_size=(2, 2),
        tile_margin=1,
        tile_margin_color=(0, 255, 0),
    )
    result = codeflash_output  # 22.5μs -> 18.7μs (20.6% faster)


def test_single_row_grid():
    # 1x3 grid, margin=2
    tile = make_tile(2, 2, (100, 100, 100))
    tiles_elements = [[tile, tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(1, 3),
        single_tile_size=(2, 2),
        tile_margin=2,
        tile_margin_color=(0, 0, 0),
    )
    result = codeflash_output  # 19.8μs -> 14.8μs (33.8% faster)


def test_non_uniform_tile_size():
    # Tiles of different sizes (should fail)
    tile1 = make_tile(2, 2, (1, 2, 3))
    tile2 = make_tile(3, 2, (4, 5, 6))
    tiles_elements = [[tile1, tile2]]
    with pytest.raises(ValueError):
        _merge_tiles_elements(
            tiles_elements,
            grid_size=(1, 2),
            single_tile_size=(2, 2),  # single_tile_size does not match tile2
            tile_margin=1,
            tile_margin_color=(0, 0, 0),
        )  # 13.0μs -> 8.42μs (54.7% faster)


def test_negative_margin():
    # Negative margin should fail
    tile = make_tile(2, 2, (1, 1, 1))
    tiles_elements = [[tile, tile]]
    with pytest.raises(ValueError):
        _merge_tiles_elements(
            tiles_elements,
            grid_size=(1, 2),
            single_tile_size=(2, 2),
            tile_margin=-1,
            tile_margin_color=(0, 0, 0),
        )  # 2.43μs -> 2.65μs (8.20% slower)


def test_empty_tiles_elements():
    # Empty input list
    tiles_elements = []
    with pytest.raises(IndexError):
        _merge_tiles_elements(
            tiles_elements,
            grid_size=(0, 0),
            single_tile_size=(0, 0),
            tile_margin=1,
            tile_margin_color=(0, 0, 0),
        )  # 13.5μs -> 7.26μs (86.0% faster)


# --------------- LARGE SCALE TEST CASES ---------------


def test_large_grid_performance():
    # 20x20 grid with small tiles and margin
    tile = make_tile(5, 5, (123, 231, 132))
    tiles_elements = [[tile for _ in range(20)] for _ in range(20)]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(20, 20),
        single_tile_size=(5, 5),
        tile_margin=2,
        tile_margin_color=(10, 20, 30),
    )
    result = codeflash_output  # 257μs -> 181μs (42.3% faster)


def test_large_tile_size():
    # 2x2 grid, each tile 100x100
    tile = make_tile(100, 100, (50, 50, 50))
    tiles_elements = [[tile, tile], [tile, tile]]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(2, 2),
        single_tile_size=(100, 100),
        tile_margin=5,
        tile_margin_color=(200, 100, 50),
    )
    result = codeflash_output  # 163μs -> 43.3μs (277% faster)


def test_large_margin():
    # 3x3 grid, margin=20
    tile = make_tile(10, 10, (1, 2, 3))
    tiles_elements = [[tile for _ in range(3)] for _ in range(3)]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(3, 3),
        single_tile_size=(10, 10),
        tile_margin=20,
        tile_margin_color=(4, 5, 6),
    )
    result = codeflash_output  # 54.9μs -> 31.8μs (72.9% faster)


def test_maximum_grid_size():
    # 50x20 grid, 2x2 tiles, margin=1
    tile = make_tile(2, 2, (11, 22, 33))
    tiles_elements = [[tile for _ in range(20)] for _ in range(50)]
    codeflash_output = _merge_tiles_elements(
        tiles_elements,
        grid_size=(50, 20),
        single_tile_size=(2, 2),
        tile_margin=1,
        tile_margin_color=(44, 55, 66),
    )
    result = codeflash_output  # 472μs -> 374μs (26.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_merge_tiles_elements-miqlt5zc and push.

The optimized code achieves a **45% speedup** through several key performance improvements: **Main Optimizations:** 1. **Replaced `np.ones() * color` with `np.full()`**: The original code used `np.ones((shape)) * color` which creates an array of ones and then multiplies by the color. The optimized version uses `np.full(shape, color, dtype=np.uint8)` which directly creates the array with the desired values and correct dtype, eliminating the multiplication step and ensuring proper data type from creation. 2. **Eliminated expensive itertools operations**: The original code used `itertools.chain.from_iterable(zip(row, [vertical_padding] * grid_size[1]))` which creates multiple intermediate objects and performs complex chaining. The optimized version uses direct list slicing (`row_with_paddings[::2] = row` and `row_with_paddings[1::2] = vertical_padding_row[:-1]`) to interleave tiles and padding more efficiently. 3. **Replaced `np.concatenate` with `np.vstack`**: For stacking arrays along the first axis (rows), `np.vstack` is more optimized than `np.concatenate(axis=0)` and uses less memory overhead. 4. **Improved list construction**: Instead of repeatedly appending to lists in loops, the optimized version preallocates list sizes where possible and uses more efficient indexing operations. **Performance Impact by Test Cases:** - **Large grids benefit most**: Tests with 20x20+ grids show 42-84% speedups, indicating the optimizations scale well with grid size - **Large tiles also benefit significantly**: The 100x100 tile test shows 277% speedup, suggesting memory allocation improvements are substantial - **Even small cases improve**: Single tile operations still gain 15-20% speedup **Function Usage Context:** Based on the function reference, `_merge_tiles_elements` is called from `_generate_tiles`, which appears to be part of an image tiling/visualization pipeline. This suggests the function is likely used for creating composite images from multiple smaller images, potentially in batch processing scenarios where the performance gains would compound across multiple calls. The optimizations are particularly valuable for computer vision workflows where large numbers of images need to be arranged in grids with margins, as the improvements scale with both grid size and individual tile size.

codeflash-ai bot requested a review from grzegorz-roboflow as a code owner December 3, 2025 22:54

codeflash-ai bot requested a review from mashraf-222 December 3, 2025 22:54

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_merge_tiles_elements` by 45% #786

⚡️ Speed up function `_merge_tiles_elements` by 45% #786

Uh oh!

codeflash-ai bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function _merge_tiles_elements by 45% #786

Are you sure you want to change the base?

⚡️ Speed up function _merge_tiles_elements by 45% #786

Uh oh!

Conversation

codeflash-ai bot commented Dec 3, 2025

📄 45% (0.45x) speedup for _merge_tiles_elements in inference/core/utils/drawing.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `_merge_tiles_elements` by 45% #786

⚡️ Speed up function `_merge_tiles_elements` by 45% #786

📄 45% (0.45x) speedup for `_merge_tiles_elements` in `inference/core/utils/drawing.py`