Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 2, 2025

📄 13% (0.13x) speedup for SelectionMixin._infer_selection in pandas/core/base.py

⏱️ Runtime : 81.6 microseconds 71.9 microseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves a 13% speedup by eliminating redundant function calls and restructuring control flow for better performance characteristics.

Key Optimizations Applied:

  1. Eliminated Redundant lib.is_scalar() Calls: The original code called lib.is_scalar(key) up to twice - once in the DataFrame path and once in the Series path. The optimized version calls it once and reuses the result, saving expensive function call overhead.

  2. Short-Circuit Logic for DataFrame Path: Restructured the DataFrame condition logic to avoid expensive key in subset operations when unnecessary. The original code always evaluated both lib.is_scalar(key) and key in subset and lib.is_list_like(key) in a compound OR expression. The optimized version checks scalar first, then only checks key in subset if the key is scalar, avoiding this lookup for non-scalar keys.

  3. Early Returns: Replaced the pattern of setting a selection variable and returning it at the end with direct returns, reducing variable assignments and improving control flow efficiency.

Performance Impact by Test Case:

  • Biggest gains (56-87% faster): Tests with scalar keys not found in DataFrame columns benefit most, as they avoid the expensive key in subset check entirely
  • Moderate gains (9-15% faster): Tests with scalar keys found in DataFrames show solid improvements from eliminating the redundant is_scalar call
  • Consistent improvements: Nearly all test cases show 1-12% speedups, with only a few edge cases showing minimal slowdowns due to slightly more complex branching

The optimization is particularly effective for common scenarios where scalar keys are used with DataFrames, making this a worthwhile performance enhancement for a frequently-called method in pandas' selection infrastructure.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 100 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
# imports
from pandas.core.base import SelectionMixin


# Minimal Series and DataFrame stand-ins for testing
class Series:
    def __init__(self, name, data=None):
        self.name = name
        self.data = data or []
        self.ndim = 1


class DataFrame:
    def __init__(self, columns):
        self.columns = set(columns)
        self.ndim = 2

    def __contains__(self, key):
        return key in self.columns


# ===================
# Unit Tests
# ===================

# Basic Test Cases


def test_dataframe_scalar_key_column_exists():
    # DataFrame, scalar key present in columns: selection should be key
    df = DataFrame(["a", "b", "c"])
    mixin = SelectionMixin()
    key = "a"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.58μs -> 1.45μs (9.17% faster)


def test_dataframe_list_key_all_exist():
    # DataFrame, list-like key, all columns exist: selection should be key
    df = DataFrame(["a", "b", "c"])
    mixin = SelectionMixin()
    key = ["a", "b"]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.48μs -> 1.46μs (1.72% faster)


def test_series_scalar_key_matches_name():
    # Series, scalar key matches series name: selection should be key
    s = Series(name="foo")
    mixin = SelectionMixin()
    key = "foo"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.09μs -> 1.04μs (4.50% faster)


def test_series_scalar_key_not_matching_name():
    # Series, scalar key does not match series name: selection should be None
    s = Series(name="foo")
    mixin = SelectionMixin()
    key = "bar"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.09μs -> 1.05μs (2.94% faster)


def test_dataframe_scalar_key_not_in_columns():
    # DataFrame, scalar key not in columns: selection should be None
    df = DataFrame(["a", "b", "c"])
    mixin = SelectionMixin()
    key = "z"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.41μs -> 1.29μs (87.0% faster)


# Edge Test Cases


def test_dataframe_tuple_key():
    # DataFrame, tuple key (list-like): selection should be key
    df = DataFrame(["x", "y", "z"])
    mixin = SelectionMixin()
    key = ("x", "y")
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.02μs -> 2.08μs (2.83% slower)


def test_dataframe_empty_list_key():
    # DataFrame, empty list as key: selection should be key (empty list)
    df = DataFrame(["a", "b"])
    mixin = SelectionMixin()
    key = []
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.30μs -> 1.24μs (5.10% faster)


def test_dataframe_key_is_set():
    # DataFrame, set as key: is_list_like returns False, so selection should be None
    df = DataFrame(["a", "b"])
    mixin = SelectionMixin()
    key = {"a", "b"}
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.31μs -> 2.16μs (6.99% faster)


def test_dataframe_key_is_string_not_in_columns():
    # DataFrame, string key not in columns: selection should be None
    df = DataFrame(["col1", "col2"])
    mixin = SelectionMixin()
    key = "notacol"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.19μs -> 1.39μs (56.9% faster)


def test_series_key_is_list():
    # Series, key is list-like: selection should be None
    s = Series(name="foo")
    mixin = SelectionMixin()
    key = ["foo"]
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.04μs -> 1.04μs (0.579% faster)


def test_series_key_is_none():
    # Series, key is None, name is None: selection should be key (None)
    s = Series(name=None)
    mixin = SelectionMixin()
    key = None
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.16μs -> 1.13μs (2.57% faster)


def test_series_key_is_none_name_not_none():
    # Series, key is None, name is not None: selection should be None
    s = Series(name="foo")
    mixin = SelectionMixin()
    key = None
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.19μs -> 1.18μs (0.339% faster)


def test_dataframe_key_is_bytes():
    # DataFrame, key is bytes and in columns: selection should be key
    df = DataFrame([b"a", b"b"])
    mixin = SelectionMixin()
    key = b"a"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.46μs -> 1.37μs (6.63% faster)


def test_series_key_is_bytes_matching_name():
    # Series, key is bytes and matches name: selection should be key
    s = Series(name=b"foo")
    mixin = SelectionMixin()
    key = b"foo"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.14μs -> 1.11μs (2.79% faster)


def test_dataframe_key_is_integer_in_columns():
    # DataFrame, key is int and in columns: selection should be key
    df = DataFrame([1, 2, 3])
    mixin = SelectionMixin()
    key = 2
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.56μs -> 1.39μs (12.1% faster)


def test_dataframe_key_is_float_in_columns():
    # DataFrame, key is float and in columns: selection should be key
    df = DataFrame([1.1, 2.2, 3.3])
    mixin = SelectionMixin()
    key = 2.2
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.49μs -> 1.40μs (6.34% faster)


def test_dataframe_key_is_bool_in_columns():
    # DataFrame, key is bool and in columns: selection should be key
    df = DataFrame([True, False])
    mixin = SelectionMixin()
    key = True
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.49μs -> 1.29μs (14.8% faster)


def test_series_key_is_bool_matching_name():
    # Series, key is bool and matches name: selection should be key
    s = Series(name=True)
    mixin = SelectionMixin()
    key = True
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.10μs -> 1.11μs (1.08% slower)


def test_dataframe_key_is_string_like_but_tuple():
    # DataFrame, key is a tuple of one string, should be treated as list-like
    df = DataFrame(["a", "b"])
    mixin = SelectionMixin()
    key = ("a",)
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.34μs -> 2.32μs (1.17% faster)


def test_dataframe_key_is_empty_tuple():
    # DataFrame, key is empty tuple: selection should be key (empty tuple)
    df = DataFrame(["a", "b"])
    mixin = SelectionMixin()
    key = ()
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.88μs -> 1.68μs (12.1% faster)


def test_series_key_is_empty_tuple():
    # Series, key is empty tuple: selection should be None
    s = Series(name="foo")
    mixin = SelectionMixin()
    key = ()
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.06μs -> 1.03μs (2.92% faster)


# Large Scale Test Cases


def test_large_dataframe_scalar_key():
    # DataFrame with 1000 columns, scalar key present: selection should be key
    cols = [f"col{i}" for i in range(1000)]
    df = DataFrame(cols)
    mixin = SelectionMixin()
    key = "col999"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.67μs -> 1.50μs (11.3% faster)


def test_large_dataframe_list_key():
    # DataFrame with 1000 columns, list key of 1000 columns: selection should be key
    cols = [f"col{i}" for i in range(1000)]
    df = DataFrame(cols)
    mixin = SelectionMixin()
    key = [f"col{i}" for i in range(1000)]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.65μs -> 1.61μs (2.17% faster)


def test_large_dataframe_partial_list_key():
    # DataFrame with 1000 columns, list key of 500 columns: selection should be key
    cols = [f"col{i}" for i in range(1000)]
    df = DataFrame(cols)
    mixin = SelectionMixin()
    key = [f"col{i}" for i in range(500)]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.56μs -> 1.41μs (10.9% faster)


def test_large_series_scalar_key():
    # Series with large data, scalar key matches name: selection should be key
    s = Series(name="bigcol", data=list(range(1000)))
    mixin = SelectionMixin()
    key = "bigcol"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.25μs -> 1.14μs (9.48% faster)


def test_large_series_scalar_key_not_matching():
    # Series with large data, scalar key not matching: selection should be None
    s = Series(name="bigcol", data=list(range(1000)))
    mixin = SelectionMixin()
    key = "notbigcol"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.15μs -> 1.09μs (5.30% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from pandas.core.base import SelectionMixin


# Minimal stubs for pandas objects and pandas._libs.lib functions
class Series:
    def __init__(self, data, name=None):
        self.data = data
        self.name = name
        self.ndim = 1


class DataFrame:
    def __init__(self, data, columns=None):
        self.data = data
        self.columns = set(columns) if columns is not None else set()
        self.ndim = 2

    def __contains__(self, key):
        return key in self.columns


# -------------------- UNIT TESTS --------------------

# Basic Test Cases


def test_scalar_key_in_dataframe_columns_returns_key():
    # Key is scalar and present in subset (DataFrame)
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = "a"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.90μs -> 1.80μs (5.50% faster)


def test_list_key_in_dataframe_columns_returns_key():
    # Key is list and all elements present in subset (DataFrame)
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = ["a", "b"]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.83μs -> 1.72μs (5.92% faster)


def test_scalar_key_in_series_with_matching_name_returns_key():
    # Key is scalar and matches subset.name (Series)
    s = Series(data=[1, 2], name="foo")
    mixin = SelectionMixin()
    key = "foo"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.17μs -> 1.16μs (1.12% faster)


def test_scalar_key_in_series_with_nonmatching_name_returns_none():
    # Key is scalar but does not match subset.name (Series)
    s = Series(data=[1, 2], name="bar")
    mixin = SelectionMixin()
    key = "foo"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.17μs -> 1.12μs (3.56% faster)


def test_list_key_in_series_returns_none():
    # Key is list, subset is Series (should not infer selection)
    s = Series(data=[1, 2], name="foo")
    mixin = SelectionMixin()
    key = ["foo"]
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.08μs -> 1.07μs (1.22% faster)


# Edge Test Cases


def test_empty_dataframe_and_scalar_key_returns_none():
    # DataFrame has no columns, key is scalar
    df = DataFrame(data=[], columns=[])
    mixin = SelectionMixin()
    key = "foo"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.56μs -> 1.41μs (81.1% faster)


def test_empty_dataframe_and_list_key_returns_none():
    # DataFrame has no columns, key is list
    df = DataFrame(data=[], columns=[])
    mixin = SelectionMixin()
    key = ["foo", "bar"]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.32μs -> 1.54μs (14.3% slower)


def test_none_key_in_dataframe_returns_none():
    # Key is None, DataFrame has columns
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = None
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.29μs -> 1.48μs (54.8% faster)


def test_none_key_in_series_returns_none():
    # Key is None, Series has name
    s = Series(data=[1, 2], name="foo")
    mixin = SelectionMixin()
    key = None
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.21μs -> 1.15μs (4.77% faster)


def test_tuple_key_in_dataframe_returns_key():
    # Key is tuple, DataFrame has columns matching tuple
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = ("a", "b")
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.12μs -> 2.20μs (3.77% slower)


def test_set_key_in_dataframe_returns_none():
    # Key is set, DataFrame has columns matching set
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = {"a", "b"}
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.21μs -> 2.20μs (0.409% faster)


def test_dict_key_in_dataframe_returns_none():
    # Key is dict, DataFrame has columns matching dict keys
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = {"a": 1, "b": 2}
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.09μs -> 1.99μs (4.77% faster)


def test_scalar_key_not_in_dataframe_returns_none():
    # Key is scalar and not present in DataFrame columns
    df = DataFrame(data=[[1, 2]], columns=["x", "y"])
    mixin = SelectionMixin()
    key = "foo"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.25μs -> 1.45μs (55.0% faster)


def test_scalar_key_is_integer_in_dataframe_returns_none():
    # Key is integer, DataFrame has string columns
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = 1
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.09μs -> 1.29μs (61.2% faster)


def test_scalar_key_is_integer_in_series_with_matching_name_returns_key():
    # Key is integer and matches Series name
    s = Series(data=[1, 2], name=1)
    mixin = SelectionMixin()
    key = 1
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.06μs -> 1.05μs (1.82% faster)


def test_scalar_key_is_float_in_series_with_matching_name_returns_key():
    # Key is float and matches Series name
    s = Series(data=[1, 2], name=3.14)
    mixin = SelectionMixin()
    key = 3.14
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.10μs -> 1.08μs (1.67% faster)


def test_scalar_key_is_bool_in_series_with_matching_name_returns_key():
    # Key is bool and matches Series name
    s = Series(data=[1, 2], name=True)
    mixin = SelectionMixin()
    key = True
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.16μs -> 1.08μs (7.34% faster)


def test_scalar_key_is_bool_in_dataframe_returns_none():
    # Key is bool, DataFrame has string columns
    df = DataFrame(data=[[1, 2]], columns=["a", "b"])
    mixin = SelectionMixin()
    key = True
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.39μs -> 1.44μs (66.7% faster)


# Large Scale Test Cases


def test_large_dataframe_with_many_columns_and_scalar_key():
    # Large DataFrame, key is present
    columns = [f"col{i}" for i in range(1000)]
    df = DataFrame(data=[[i for i in range(1000)]], columns=columns)
    mixin = SelectionMixin()
    key = "col999"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.72μs -> 1.54μs (11.7% faster)


def test_large_dataframe_with_many_columns_and_list_key():
    # Large DataFrame, key is list of many columns
    columns = [f"col{i}" for i in range(1000)]
    df = DataFrame(data=[[i for i in range(1000)]], columns=columns)
    mixin = SelectionMixin()
    key = [f"col{i}" for i in range(500, 1000)]
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 1.84μs -> 1.91μs (3.41% slower)


def test_large_dataframe_with_many_columns_and_nonexistent_scalar_key():
    # Large DataFrame, key is scalar not present
    columns = [f"col{i}" for i in range(1000)]
    df = DataFrame(data=[[i for i in range(1000)]], columns=columns)
    mixin = SelectionMixin()
    key = "col1001"
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.59μs -> 1.57μs (64.7% faster)


def test_large_series_with_matching_name():
    # Large Series, key matches name
    s = Series(data=[i for i in range(1000)], name="bigseries")
    mixin = SelectionMixin()
    key = "bigseries"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.22μs -> 1.15μs (6.46% faster)


def test_large_series_with_nonmatching_name():
    # Large Series, key does not match name
    s = Series(data=[i for i in range(1000)], name="bigseries")
    mixin = SelectionMixin()
    key = "notseries"
    codeflash_output = mixin._infer_selection(key, s)
    result = codeflash_output  # 1.14μs -> 1.11μs (3.34% faster)


def test_large_dataframe_with_large_tuple_key():
    # Large DataFrame, key is a tuple of many columns
    columns = [f"col{i}" for i in range(1000)]
    df = DataFrame(data=[[i for i in range(1000)]], columns=columns)
    mixin = SelectionMixin()
    key = tuple(f"col{i}" for i in range(500, 1000))
    codeflash_output = mixin._infer_selection(key, df)
    result = codeflash_output  # 2.39μs -> 2.39μs (0.126% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-SelectionMixin._infer_selection-mio2t65p and push.

Codeflash Static Badge

The optimization achieves a **13% speedup** by eliminating redundant function calls and restructuring control flow for better performance characteristics.

**Key Optimizations Applied:**

1. **Eliminated Redundant `lib.is_scalar()` Calls**: The original code called `lib.is_scalar(key)` up to twice - once in the DataFrame path and once in the Series path. The optimized version calls it once and reuses the result, saving expensive function call overhead.

2. **Short-Circuit Logic for DataFrame Path**: Restructured the DataFrame condition logic to avoid expensive `key in subset` operations when unnecessary. The original code always evaluated both `lib.is_scalar(key) and key in subset` and `lib.is_list_like(key)` in a compound OR expression. The optimized version checks scalar first, then only checks `key in subset` if the key is scalar, avoiding this lookup for non-scalar keys.

3. **Early Returns**: Replaced the pattern of setting a `selection` variable and returning it at the end with direct returns, reducing variable assignments and improving control flow efficiency.

**Performance Impact by Test Case:**
- **Biggest gains (56-87% faster)**: Tests with scalar keys not found in DataFrame columns benefit most, as they avoid the expensive `key in subset` check entirely
- **Moderate gains (9-15% faster)**: Tests with scalar keys found in DataFrames show solid improvements from eliminating the redundant `is_scalar` call
- **Consistent improvements**: Nearly all test cases show 1-12% speedups, with only a few edge cases showing minimal slowdowns due to slightly more complex branching

The optimization is particularly effective for common scenarios where scalar keys are used with DataFrames, making this a worthwhile performance enhancement for a frequently-called method in pandas' selection infrastructure.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 2, 2025 04:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant