Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 5, 2025

📄 40% (0.40x) speedup for _sample_negative_indices in src/transformers/models/wav2vec2/modeling_wav2vec2.py

⏱️ Runtime : 3.32 milliseconds 2.37 milliseconds (best of 90 runs)

📝 Explanation and details

The optimization achieves a 39% speedup through several key improvements that reduce computational overhead and redundant operations:

Key Optimizations:

  1. Conditional restructuring: The original code uses a ternary expression for mask handling that gets evaluated on every call. The optimized version uses explicit if/else branching, which is more efficient and clearer.

  2. Early exit optimization: Added a masked_count == 0 check that skips the expensive sampling logic when no positions are masked. This provides dramatic speedups (167-282% faster) for edge cases with empty masks, as shown in the test results.

  3. Local variable caching: Stores mask_time_indices[batch_idx] in mask_row and mask_row.sum() in masked_count to avoid repeated array indexing and sum computations within the loop.

  4. Vectorized arithmetic: Replaces sampled_indices[sampled_indices >= feature_indices] += 1 with np.add(sampled_indices, sampled_indices >= feature_indices, out=sampled_indices). This uses NumPy's optimized in-place addition with broadcasting, avoiding temporary boolean array allocations and reducing memory pressure.

  5. Removed unnecessary broadcasting: Eliminates np.broadcast_to() call by using np.arange(high + 1)[:, None] directly, which is more memory-efficient.

Performance Impact:

  • Best gains on edge cases with sparse or empty masks (167-282% faster)
  • Consistent improvements across all test scenarios (14-45% faster)
  • Largest absolute time savings come from the most expensive operations: np.random.randint() calls and vectorized array operations, which remain the dominant cost but are now more efficiently managed

The optimizations are particularly effective because they target the hot path of negative sampling in contrastive learning, where this function is called repeatedly during model training.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 40 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import numpy as np

# imports
import pytest

from transformers.models.wav2vec2.modeling_wav2vec2 import _sample_negative_indices


# unit tests

# ------------------- BASIC TEST CASES -------------------


def test_basic_shape_and_type():
    # Basic: 2 batches, 4 time steps, 3 negatives, no mask
    features_shape = (2, 4)
    num_negatives = 3
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 112μs -> 93.2μs (20.6% faster)


def test_basic_no_mask_all_indices_in_range():
    # Basic: 1 batch, 5 time steps, 2 negatives, no mask
    features_shape = (1, 5)
    num_negatives = 2
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 65.5μs -> 55.1μs (18.9% faster)


def test_basic_mask_some_timesteps():
    # Basic: 1 batch, 6 time steps, 2 negatives, mask some
    features_shape = (1, 6)
    num_negatives = 2
    mask = np.array([[1, 0, 1, 0, 1, 0]], dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 58.8μs -> 48.0μs (22.4% faster)
    # Only masked positions should be filled; others should remain zero
    for t in range(6):
        if mask[0, t]:
            pass
        else:
            pass


def test_basic_multiple_batches_and_mask():
    # 2 batches, 5 time steps, 2 negatives, mask different for each batch
    features_shape = (2, 5)
    num_negatives = 2
    mask = np.array([[1, 1, 0, 1, 0], [0, 1, 1, 1, 0]], dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 83.6μs -> 64.4μs (29.8% faster)
    # For each batch, only masked positions should be filled
    for b in range(2):
        for t in range(5):
            if mask[b, t]:
                pass
            else:
                pass


# ------------------- EDGE TEST CASES -------------------


def test_edge_all_masked():
    # Every position is masked
    features_shape = (1, 4)
    num_negatives = 2
    mask = np.ones((1, 4), dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 55.7μs -> 45.3μs (23.0% faster)


def test_edge_none_masked():
    # No positions are masked: should return all zeros (since nothing to sample)
    features_shape = (1, 5)
    num_negatives = 2
    mask = np.zeros((1, 5), dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 52.7μs -> 19.7μs (167% faster)


def test_edge_one_time_step():
    # Only one time step: should not sample itself, but only one choice
    features_shape = (1, 1)
    num_negatives = 1
    with pytest.raises(ValueError):
        # Should raise error because randint(0,0) is invalid
        _sample_negative_indices(features_shape, num_negatives)  # 57.3μs -> 48.0μs (19.5% faster)


def test_edge_num_negatives_zero():
    # Zero negatives requested
    features_shape = (2, 3)
    num_negatives = 0
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 79.6μs -> 69.5μs (14.4% faster)


def test_edge_num_negatives_equals_timesteps():
    # num_negatives == sequence_length-1, should sample all possible negatives
    features_shape = (1, 4)
    num_negatives = 3
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 63.5μs -> 53.5μs (18.8% faster)
    # Each negative index should be in [0,3] and != its own position
    for t in range(4):
        for n in range(3):
            pass


def test_edge_mask_all_false():
    # Mask is all False, nothing to sample, should be all zeros
    features_shape = (2, 3)
    num_negatives = 2
    mask = np.zeros((2, 3), dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 70.7μs -> 22.3μs (217% faster)


def test_edge_large_num_negatives():
    # num_negatives is very large, but less than sequence_length
    features_shape = (1, 10)
    num_negatives = 9
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 70.2μs -> 60.0μs (16.9% faster)
    # Each negative index should be in [0,9] and != its own position
    for t in range(10):
        for n in range(9):
            pass


# ------------------- LARGE SCALE TEST CASES -------------------


def test_large_batch_and_timesteps():
    # Large batch and sequence
    features_shape = (10, 100)
    num_negatives = 5
    mask = np.random.choice([True, False], size=features_shape, p=[0.7, 0.3])
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 287μs -> 214μs (34.1% faster)
    # For each batch, negatives should be in correct range
    for b in range(10):
        for t in range(100):
            if mask[b, t]:
                pass
            else:
                pass


def test_large_num_negatives_and_mask():
    # Large num_negatives, mask about half
    features_shape = (5, 50)
    num_negatives = 25
    mask = np.random.choice([True, False], size=features_shape)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 167μs -> 115μs (45.4% faster)
    for b in range(5):
        for t in range(50):
            if mask[b, t]:
                pass
            else:
                pass


def test_large_all_masked():
    # All positions masked, large batch/sequence
    features_shape = (8, 80)
    num_negatives = 10
    mask = np.ones(features_shape, dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 298μs -> 209μs (43.1% faster)
    for b in range(8):
        pass


def test_large_all_unmasked():
    # None masked, should be all zeros
    features_shape = (8, 80)
    num_negatives = 10
    mask = np.zeros(features_shape, dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 137μs -> 36.0μs (282% faster)


def test_large_randomized_mask_and_negatives():
    # Random mask, random negatives
    features_shape = (7, 77)
    num_negatives = 7
    mask = np.random.choice([True, False], size=features_shape)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 192μs -> 135μs (41.9% faster)
    for b in range(7):
        for t in range(77):
            if mask[b, t]:
                pass
            else:
                pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import numpy as np  # used by the function under test

# imports
import pytest  # used for our unit tests

from transformers.models.wav2vec2.modeling_wav2vec2 import _sample_negative_indices


# unit tests

# ---------------- BASIC TEST CASES ----------------


def test_basic_shape_and_type():
    # Basic test: shape and dtype for batch_size=2, sequence_length=4, num_negatives=3
    features_shape = (2, 4)
    num_negatives = 3
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 81.1μs -> 66.7μs (21.5% faster)


def test_basic_no_mask_uniform_distribution():
    # With no mask, all positions are valid, negatives should be from the same batch
    features_shape = (1, 5)
    num_negatives = 2
    # Run multiple times to check distribution
    for _ in range(10):
        codeflash_output = _sample_negative_indices(features_shape, num_negatives)
        result = codeflash_output  # 227μs -> 168μs (34.8% faster)
        # Should not sample the same index as the positive (for each position)
        for pos in range(5):
            pass


def test_basic_with_mask():
    # With mask, only masked positions are sampled
    features_shape = (1, 6)
    num_negatives = 2
    mask = np.array([[True, False, True, True, False, True]])
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 53.5μs -> 44.1μs (21.3% faster)
    # Only masked positions should be filled
    for pos in range(6):
        if mask[0, pos]:
            # Should not sample the same index as the positive (remapped index)
            for neg_idx in range(num_negatives):
                pass
        else:
            pass


def test_basic_batch_size_greater_than_one():
    # Test with batch_size > 1
    features_shape = (3, 4)
    num_negatives = 2
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 96.4μs -> 73.7μs (30.8% faster)
    # Indices for batch 1 should be in [0,3], batch 2 in [4,7], batch 3 in [8,11]
    for batch_idx in range(3):
        batch_start = batch_idx * 4
        batch_end = batch_start + 4
        for pos in range(4):
            pass


# ---------------- EDGE TEST CASES ----------------


def test_edge_num_negatives_zero():
    # num_negatives = 0 should return shape (batch_size, sequence_length, 0)
    features_shape = (2, 3)
    num_negatives = 0
    codeflash_output = _sample_negative_indices(features_shape, num_negatives)
    result = codeflash_output  # 78.5μs -> 64.3μs (22.2% faster)


def test_edge_all_masked_false():
    # All mask_time_indices are False, should not sample any negatives
    features_shape = (2, 4)
    num_negatives = 2
    mask = np.zeros(features_shape, dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 92.5μs -> 34.7μs (167% faster)


def test_edge_num_negatives_greater_than_masked():
    # num_negatives > number of masked positions
    features_shape = (1, 5)
    num_negatives = 10
    mask = np.array([[True, False, True, False, True]])
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 80.7μs -> 67.3μs (19.9% faster)
    # Should not crash, and all indices should be among masked positions
    allowed = [0, 2, 4]
    for pos in range(5):
        if mask[0, pos]:
            pass
        else:
            pass


# ---------------- LARGE SCALE TEST CASES ----------------


def test_large_scale_batch_and_sequence():
    # Large batch and sequence, but within 1000 elements
    features_shape = (10, 100)
    num_negatives = 5
    mask = np.ones(features_shape, dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 302μs -> 214μs (41.1% faster)
    # Check that for each batch, indices are in correct range and not equal to positive index
    for batch_idx in range(10):
        batch_start = batch_idx * 100
        batch_end = batch_start + 100
        for pos in range(100):
            pass


def test_large_scale_masked_sparse():
    # Large batch/sequence, sparse mask
    features_shape = (5, 200)
    num_negatives = 3
    mask = np.zeros(features_shape, dtype=bool)
    # Mask every 10th position
    mask[:, ::10] = True
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 135μs -> 104μs (29.8% faster)
    # Only masked positions should be filled
    for batch_idx in range(5):
        batch_start = batch_idx * 200
        allowed = list(range(0, 200, 10))
        for pos in range(200):
            if mask[batch_idx, pos]:
                pass
            else:
                pass


def test_large_scale_random_mask():
    # Random mask, check performance and correctness
    np.random.seed(42)
    features_shape = (7, 80)
    num_negatives = 4
    mask = np.random.choice([True, False], size=features_shape, p=[0.3, 0.7])
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result = codeflash_output  # 158μs -> 116μs (35.4% faster)
    for batch_idx in range(7):
        batch_start = batch_idx * 80
        allowed = [i for i in range(80) if mask[batch_idx, i]]
        for pos in range(80):
            if mask[batch_idx, pos]:
                pass
            else:
                pass


# ---------------- DETERMINISM TESTS ----------------


def test_determinism_with_seed():
    # With a fixed seed, output should be deterministic
    np.random.seed(123)
    features_shape = (2, 5)
    num_negatives = 2
    mask = np.ones(features_shape, dtype=bool)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result1 = codeflash_output  # 73.3μs -> 58.5μs (25.2% faster)
    np.random.seed(123)
    codeflash_output = _sample_negative_indices(features_shape, num_negatives, mask)
    result2 = codeflash_output  # 36.6μs -> 26.8μs (36.9% faster)


# ---------------- ERROR HANDLING TESTS ----------------


def test_invalid_shape():
    # Should raise error if features_shape is not length 2
    with pytest.raises(ValueError):
        _sample_negative_indices((5,), 2)  # 1.66μs -> 1.66μs (0.241% faster)
    with pytest.raises(ValueError):
        _sample_negative_indices((2, 3, 4), 2)  # 860ns -> 893ns (3.70% slower)

To edit these changes git checkout codeflash/optimize-_sample_negative_indices-misl46be and push.

Codeflash Static Badge

The optimization achieves a **39% speedup** through several key improvements that reduce computational overhead and redundant operations:

**Key Optimizations:**

1. **Conditional restructuring**: The original code uses a ternary expression for mask handling that gets evaluated on every call. The optimized version uses explicit if/else branching, which is more efficient and clearer.

2. **Early exit optimization**: Added a `masked_count == 0` check that skips the expensive sampling logic when no positions are masked. This provides dramatic speedups (167-282% faster) for edge cases with empty masks, as shown in the test results.

3. **Local variable caching**: Stores `mask_time_indices[batch_idx]` in `mask_row` and `mask_row.sum()` in `masked_count` to avoid repeated array indexing and sum computations within the loop.

4. **Vectorized arithmetic**: Replaces `sampled_indices[sampled_indices >= feature_indices] += 1` with `np.add(sampled_indices, sampled_indices >= feature_indices, out=sampled_indices)`. This uses NumPy's optimized in-place addition with broadcasting, avoiding temporary boolean array allocations and reducing memory pressure.

5. **Removed unnecessary broadcasting**: Eliminates `np.broadcast_to()` call by using `np.arange(high + 1)[:, None]` directly, which is more memory-efficient.

**Performance Impact:**
- **Best gains** on edge cases with sparse or empty masks (167-282% faster)
- **Consistent improvements** across all test scenarios (14-45% faster)
- **Largest absolute time savings** come from the most expensive operations: `np.random.randint()` calls and vectorized array operations, which remain the dominant cost but are now more efficiently managed

The optimizations are particularly effective because they target the hot path of negative sampling in contrastive learning, where this function is called repeatedly during model training.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 5, 2025 08:10
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant