Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 2, 2025

📄 49% (0.49x) speedup for inverse_permutation in xarray/core/nputils.py

⏱️ Runtime : 743 microseconds 499 microseconds (best of 26 runs)

📝 Explanation and details

The optimized code achieves a 48% speedup through three key micro-optimizations that reduce memory allocation overhead and eliminate unnecessary function calls:

Key optimizations:

  1. Replaced np.full() with np.empty() + .fill(): The original code used np.full(N, -1, dtype=np.intp) which creates a temporary array with the fill value and then copies it. The optimized version uses np.empty() to allocate uninitialized memory, then fills it in-place with .fill(-1). This avoids the temporary array creation and is more memory-efficient.

  2. Used .size instead of len(): For NumPy arrays, .size is a direct attribute access while len() involves a function call. This provides a marginal but consistent performance gain across all test cases.

  3. Consistent use of indices.size: The optimization consistently uses indices.size in both places where array length is needed, maintaining code consistency while capturing the performance benefit.

Why these optimizations work:

  • np.empty() + .fill() reduces memory allocations from 2 to 1 and avoids broadcasting overhead
  • Direct attribute access (.size) is faster than function calls (len()) in Python
  • The core algorithmic logic remains unchanged, preserving correctness and safety

Performance impact:
Based on the function references, inverse_permutation is called in data indexing and concatenation operations within xarray's groupby and index management systems. These are potentially hot paths during data manipulation workflows. The consistent 40-60% speedup across all test cases (from small arrays to 1000-element arrays) indicates the optimization benefits both small-scale operations and larger data processing tasks.

Test case performance:
The optimization performs particularly well on:

  • Large arrays (41-57% speedup on 1000-element tests)
  • Edge cases like empty arrays (55% speedup)
  • Simple permutations (46-54% speedup)
  • Even error cases see 32-37% improvement due to reduced setup overhead

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 46 Passed
⏪ Replay Tests 82 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import numpy as np

# imports
import pytest  # used for our unit tests
from xarray.core.nputils import inverse_permutation

# unit tests

# ------------------ BASIC TEST CASES ------------------


def test_identity_permutation():
    # Identity permutation: [0, 1, 2, 3]
    indices = np.array([0, 1, 2, 3])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 12.6μs -> 8.51μs (48.6% faster)


def test_simple_permutation():
    # Permutation: [2, 0, 3, 1]
    indices = np.array([2, 0, 3, 1])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 12.2μs -> 8.18μs (49.2% faster)


def test_reverse_permutation():
    # Permutation: [3, 2, 1, 0]
    indices = np.array([3, 2, 1, 0])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 12.1μs -> 7.85μs (54.6% faster)


def test_nonzero_start():
    # Permutation: [1, 2, 0]
    indices = np.array([1, 2, 0])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 11.7μs -> 8.00μs (46.7% faster)


def test_single_element():
    # Single element: [0]
    indices = np.array([0])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 11.6μs -> 7.55μs (53.7% faster)


# ------------------ EDGE TEST CASES ------------------


def test_empty_array():
    # Empty array should return empty array
    indices = np.array([], dtype=int)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 11.5μs -> 7.38μs (55.3% faster)


def test_partial_permutation_with_N():
    # indices does not cover all positions, N > len(indices)
    indices = np.array([1, 3])
    N = 5
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 11.3μs -> 7.58μs (48.9% faster)
    # Positions not covered should be -1
    expected = np.array([-1, 0, -1, 1, -1])


def test_indices_with_duplicates():
    # Duplicates in indices: [1, 1, 2]
    indices = np.array([1, 1, 2])
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 12.7μs -> 8.62μs (47.1% faster)
    # Last occurrence should be used: result[1] == 1
    expected = np.array([-1, 1, 2])


def test_indices_with_negative_numbers():
    # Negative indices: [-1, 0, 1]
    indices = np.array([-1, 0, 1])
    N = 3
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 11.7μs -> 7.94μs (47.3% faster)
    # Should assign -1 (last element) to position 2
    expected = np.array([1, 2, 0])


def test_indices_with_large_N():
    # indices smaller than N, with some positions missing
    indices = np.array([0, 2])
    N = 4
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 11.6μs -> 7.72μs (50.5% faster)
    expected = np.array([0, -1, 1, -1])


def test_indices_dtype_int64():
    # indices with dtype int64
    indices = np.array([2, 0, 1], dtype=np.int64)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 11.3μs -> 7.36μs (53.1% faster)
    expected = np.array([1, 2, 0])


def test_indices_dtype_int32():
    # indices with dtype int32
    indices = np.array([1, 0, 2], dtype=np.int32)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 19.3μs -> 14.5μs (33.4% faster)
    expected = np.array([1, 0, 2])


def test_indices_dtype_intp():
    # indices with dtype intp
    indices = np.array([2, 1, 0], dtype=np.intp)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 11.0μs -> 7.09μs (55.1% faster)
    expected = np.array([2, 1, 0])


def test_N_smaller_than_indices_length():
    # N smaller than indices length, should raise IndexError
    indices = np.array([0, 1, 2])
    N = 2
    with pytest.raises(IndexError):
        inverse_permutation(indices, N)  # 13.5μs -> 10.2μs (32.5% faster)


def test_indices_out_of_bounds():
    # indices contains out-of-bounds value, should raise IndexError
    indices = np.array([0, 1, 5])
    N = 3
    with pytest.raises(IndexError):
        inverse_permutation(indices, N)  # 13.8μs -> 10.2μs (35.6% faster)


def test_indices_empty_with_N():
    # indices empty but N > 0
    indices = np.array([], dtype=int)
    N = 3
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 11.7μs -> 7.74μs (51.5% faster)
    expected = np.array([-1, -1, -1])


# ------------------ LARGE SCALE TEST CASES ------------------


def test_large_identity_permutation():
    # Large identity permutation
    N = 1000
    indices = np.arange(N)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 13.3μs -> 9.41μs (41.4% faster)


def test_large_reverse_permutation():
    # Large reverse permutation
    N = 1000
    indices = np.arange(N - 1, -1, -1)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 13.1μs -> 9.12μs (44.1% faster)
    expected = np.arange(N - 1, -1, -1)


def test_large_random_permutation():
    # Large random permutation
    N = 1000
    rng = np.random.default_rng(42)
    indices = rng.permutation(N)
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 9.98μs -> 6.68μs (49.5% faster)


def test_large_partial_permutation_with_N():
    # Large partial permutation, indices covers only half of N
    N = 1000
    indices = np.arange(0, N, 2)  # 0, 2, 4, ..., 998
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 12.4μs -> 7.92μs (57.2% faster)
    # Odd positions should be -1, even positions should be their index//2
    expected = np.full(N, -1, dtype=np.intp)
    expected[indices] = np.arange(len(indices), dtype=np.intp)


def test_large_duplicates():
    # Large array with duplicates
    N = 1000
    indices = np.arange(N)
    indices[500:] = 0  # duplicate 0 for half the array
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 10.3μs -> 7.46μs (38.1% faster)
    # Last occurrence should be used: result[0] == N-1
    expected = np.full(N, -1, dtype=np.intp)
    expected[0] = N - 1
    expected[1:500] = np.arange(1, 500)


def test_large_N_with_empty_indices():
    # Large N with empty indices
    N = 1000
    indices = np.array([], dtype=int)
    codeflash_output = inverse_permutation(indices, N)
    result = codeflash_output  # 12.2μs -> 8.13μs (49.8% faster)
    expected = np.full(N, -1, dtype=np.intp)


# ------------------ PROPERTY-BASED TESTS ------------------


@pytest.mark.parametrize(
    "indices",
    [
        np.array([0, 1, 2, 3]),
        np.array([3, 2, 1, 0]),
        np.array([2, 0, 3, 1]),
        np.array([1, 2, 0]),
    ],
)
def test_inverse_property(indices):
    # For any valid permutation, result[indices] == np.arange(len(indices))
    codeflash_output = inverse_permutation(indices)
    result = codeflash_output  # 58.0μs -> 39.2μs (47.9% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np

# imports
import pytest
from xarray.core.nputils import inverse_permutation

# unit tests

# 1. Basic Test Cases


def test_identity_permutation():
    # Identity permutation should return itself as the inverse
    arr = np.array([0, 1, 2, 3, 4])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 12.7μs -> 8.48μs (49.3% faster)


def test_simple_permutation():
    # Simple permutation
    arr = np.array([2, 0, 1])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 12.2μs -> 8.17μs (49.1% faster)


def test_reverse_permutation():
    # Reversing a permutation
    arr = np.array([4, 3, 2, 1, 0])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 12.1μs -> 7.87μs (54.1% faster)


def test_nonzero_start_indices():
    # Indices not starting at zero
    arr = np.array([3, 4, 2])
    codeflash_output = inverse_permutation(arr, N=5)
    inv = codeflash_output  # 11.9μs -> 8.46μs (40.8% faster)
    # Should fill missing with -1 at positions 0 and 1
    expected = np.array([-1, -1, 2, 0, 1])


def test_permutation_with_N_greater_than_len_indices():
    # N greater than len(indices)
    arr = np.array([1, 3])
    codeflash_output = inverse_permutation(arr, N=4)
    inv = codeflash_output  # 11.7μs -> 8.05μs (45.8% faster)
    # Should fill missing with -1 at positions 0 and 2
    expected = np.array([-1, 0, -1, 1])


# 2. Edge Test Cases


def test_empty_indices():
    # Empty indices array
    arr = np.array([], dtype=int)
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 11.5μs -> 7.63μs (51.1% faster)


def test_single_element():
    # Single element permutation
    arr = np.array([0])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 11.4μs -> 7.72μs (47.0% faster)


def test_indices_with_gaps():
    # Indices with gaps, N specified
    arr = np.array([2, 5])
    codeflash_output = inverse_permutation(arr, N=6)
    inv = codeflash_output  # 11.5μs -> 7.93μs (44.9% faster)
    # Should fill missing with -1 at positions 0,1,3,4
    expected = np.array([-1, -1, 0, -1, -1, 1])


def test_indices_with_duplicates():
    # Duplicates in indices (invalid permutation, but function doesn't check)
    arr = np.array([1, 1, 2])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 11.7μs -> 8.02μs (45.9% faster)
    # The last occurrence should win: [1,1,2] -> inv[1]=1, inv[2]=2, inv[0]=-1
    expected = np.array([-1, 1, 2])


def test_indices_out_of_bounds():
    # Indices out of bounds should raise IndexError
    arr = np.array([0, 2, 5])
    with pytest.raises(IndexError):
        inverse_permutation(arr, N=3)  # 14.3μs -> 10.4μs (37.3% faster)


def test_negative_indices():
    # Negative indices should fill from the end (numpy behavior)
    arr = np.array([0, -1])
    codeflash_output = inverse_permutation(arr, N=3)
    inv = codeflash_output  # 11.5μs -> 8.07μs (42.4% faster)
    # arr[0]=0, arr[1]=-1 -> inv[0]=0, inv[2]=1, inv[1]=-1
    expected = np.array([0, -1, 1])


def test_N_zero():
    # N=0 should return empty array
    arr = np.array([], dtype=int)
    codeflash_output = inverse_permutation(arr, N=0)
    inv = codeflash_output  # 11.9μs -> 7.52μs (57.9% faster)


def test_N_less_than_indices_length():
    # N < len(indices) should raise IndexError
    arr = np.array([0, 1, 2])
    with pytest.raises(IndexError):
        inverse_permutation(arr, N=2)  # 14.0μs -> 10.1μs (37.8% faster)


def test_indices_dtype_int64_and_int32():
    # Should work with both int32 and int64 dtypes
    arr32 = np.array([2, 0, 1], dtype=np.int32)
    arr64 = np.array([2, 0, 1], dtype=np.int64)
    codeflash_output = inverse_permutation(arr32)
    inv32 = codeflash_output  # 18.9μs -> 14.2μs (32.7% faster)
    codeflash_output = inverse_permutation(arr64)
    inv64 = codeflash_output  # 2.98μs -> 1.82μs (63.8% faster)
    expected = np.array([1, 2, 0])


# 3. Large Scale Test Cases


def test_large_identity_permutation():
    # Large identity permutation
    N = 1000
    arr = np.arange(N)
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 13.2μs -> 9.31μs (41.9% faster)


def test_large_reverse_permutation():
    # Large reverse permutation
    N = 1000
    arr = np.arange(N - 1, -1, -1)
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 13.5μs -> 9.70μs (39.4% faster)


def test_large_random_permutation():
    # Large random permutation
    N = 1000
    rng = np.random.default_rng(42)
    arr = rng.permutation(N)
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 10.1μs -> 6.86μs (46.6% faster)
    # arr[inv] should be np.arange(N)
    result = arr[inv]


def test_large_permutation_with_gaps():
    # Large permutation with gaps, N > len(indices)
    N = 1000
    arr = np.arange(0, N, 2)  # Only even indices
    codeflash_output = inverse_permutation(arr, N=N)
    inv = codeflash_output  # 12.9μs -> 8.84μs (45.6% faster)
    # Odd positions should be -1, even positions should be their order in arr
    expected = np.full(N, -1, dtype=np.intp)
    expected[arr] = np.arange(len(arr))


def test_large_permutation_with_duplicates():
    # Large permutation with duplicates (invalid, but test behavior)
    N = 1000
    arr = np.concatenate([np.arange(N // 2), np.arange(N // 2)])
    codeflash_output = inverse_permutation(arr)
    inv = codeflash_output  # 10.7μs -> 7.91μs (35.0% faster)
    # The last occurrence should win
    expected = np.arange(N // 2, N)
    result = inv[: N // 2]


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from xarray.core.nputils import inverse_permutation

Timer unit: 1e-09 s
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_xarrayteststest_concat_py_xarrayteststest_computation_py_xarrayteststest_formatting_py_xarray__replay_test_0.py::test_xarray_core_nputils_inverse_permutation 172μs 107μs 61.5%✅

To edit these changes git checkout codeflash/optimize-inverse_permutation-mio2rvgo and push.

Codeflash Static Badge

The optimized code achieves a **48% speedup** through three key micro-optimizations that reduce memory allocation overhead and eliminate unnecessary function calls:

**Key optimizations:**

1. **Replaced `np.full()` with `np.empty() + .fill()`**: The original code used `np.full(N, -1, dtype=np.intp)` which creates a temporary array with the fill value and then copies it. The optimized version uses `np.empty()` to allocate uninitialized memory, then fills it in-place with `.fill(-1)`. This avoids the temporary array creation and is more memory-efficient.

2. **Used `.size` instead of `len()`**: For NumPy arrays, `.size` is a direct attribute access while `len()` involves a function call. This provides a marginal but consistent performance gain across all test cases.

3. **Consistent use of `indices.size`**: The optimization consistently uses `indices.size` in both places where array length is needed, maintaining code consistency while capturing the performance benefit.

**Why these optimizations work:**
- `np.empty()` + `.fill()` reduces memory allocations from 2 to 1 and avoids broadcasting overhead
- Direct attribute access (`.size`) is faster than function calls (`len()`) in Python
- The core algorithmic logic remains unchanged, preserving correctness and safety

**Performance impact:**
Based on the function references, `inverse_permutation` is called in data indexing and concatenation operations within xarray's groupby and index management systems. These are potentially hot paths during data manipulation workflows. The consistent 40-60% speedup across all test cases (from small arrays to 1000-element arrays) indicates the optimization benefits both small-scale operations and larger data processing tasks.

**Test case performance:**
The optimization performs particularly well on:
- Large arrays (41-57% speedup on 1000-element tests)
- Edge cases like empty arrays (55% speedup) 
- Simple permutations (46-54% speedup)
- Even error cases see 32-37% improvement due to reduced setup overhead
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 2, 2025 04:25
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant