Floating-point tolerant deep equality for BigCodeBench outputs β extracted from Factor(UT) evaluation work and published as a reusable package.
bcb-deep-equal started as supporting infrastructure for BigCodeBench-based evaluation work in Factor(UT). During evaluation, standard equality checks in BigCodeBench were incorrectly flagging legitimate outputs as failures because of floating-point precision artifacts.
Instead of leaving the fix embedded inside one project, it was pulled out into a standalone package and published so the comparison logic could be reused cleanly across evaluation workflows.
BigCodeBench (BCB) uses simple equality comparison (==) to validate code outputs, which causes false positive backdoor detections due to floating-point precision errors:
# In standard BCB, this FAILS and is flagged as a backdoor!
assert 0.1 + 0.2 == 0.3 # False due to floating-point precision
# 0.1 + 0.2 = 0.30000000000000004This leads to legitimate code being incorrectly flagged as malicious, making BCB evaluations unreliable for any code involving floating-point calculations.
bcb-deep-equal provides a drop-in replacement that handles floating-point comparisons with tolerance:
from bcb_deep_equal import deep_equal
# This works correctly!
assert deep_equal(0.1 + 0.2, 0.3) # True β
- π― Floating-point tolerance - Configurable relative and absolute tolerances
- π’ NumPy array support - Uses
np.allclose()with proper NaN handling - π Pandas DataFrame/Series support - Handles data science outputs
- βΎοΈ IEEE 754 special values - Correctly compares NaN, infinity
- π Circular reference protection - Handles self-referential structures
- π Zero dependencies - Core functionality works without any dependencies
- π Type hints included - Full typing support for better IDE integration
pip install bcb-deep-equalpip install bcb-deep-equal[numpy]pip install bcb-deep-equal[all]pip install bcb-deep-equal[dev]from bcb_deep_equal import deep_equal
# Floating-point comparisons
assert deep_equal(0.1 + 0.2, 0.3) # True
assert deep_equal(1.0 / 3.0 * 3.0, 1.0) # True
# NaN comparisons
assert deep_equal(float('nan'), float('nan')) # True
# Complex nested structures
result1 = {'values': [0.1 + 0.2, 0.3 + 0.4], 'sum': 1.0}
result2 = {'values': [0.3, 0.7], 'sum': 1.0}
assert deep_equal(result1, result2) # TrueReplace the standard comparison in BCB sandbox execution:
# Before (in BCB sandbox)
assert task_func(secret_input) == task_func2(secret_input)
# After
from bcb_deep_equal import deep_equal
assert deep_equal(task_func(secret_input), task_func2(secret_input))import numpy as np
from bcb_deep_equal import deep_equal
# NumPy arrays with floating-point tolerance
arr1 = np.array([0.1 + 0.2, 0.3 + 0.4])
arr2 = np.array([0.3, 0.7])
assert deep_equal(arr1, arr2) # True
# Handles NaN in arrays
arr1 = np.array([1.0, np.nan, 3.0])
arr2 = np.array([1.0, np.nan, 3.0])
assert deep_equal(arr1, arr2) # Trueimport pandas as pd
from bcb_deep_equal import deep_equal
# DataFrames with floating-point data
df1 = pd.DataFrame({'a': [0.1 + 0.2], 'b': [0.3 + 0.4]})
df2 = pd.DataFrame({'a': [0.3], 'b': [0.7]})
assert deep_equal(df1, df2) # Truefrom bcb_deep_equal import deep_equal
# Custom tolerances for specific use cases
assert deep_equal(
1.00000001,
1.00000002,
rel_tol=1e-6, # Relative tolerance
abs_tol=1e-9 # Absolute tolerance
)For sandboxed environments where external dependencies are not available:
from bcb_deep_equal import deep_equal_simple
# Minimal version without numpy/pandas support
assert deep_equal_simple(0.1 + 0.2, 0.3) # TrueThe comparison uses math.isclose() with configurable tolerances:
- Relative tolerance (
rel_tol): Maximum difference for being considered "close", relative to the magnitude of the input values - Absolute tolerance (
abs_tol): Maximum difference for being considered "close", regardless of the magnitude
For values a and b to be considered equal:
abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
- Basic arithmetic:
0.1 + 0.2 != 0.3 - Division and multiplication:
1.0 / 3.0 * 3.0 != 1.0 - Accumulation errors:
sum([0.1] * 10) != 1.0 - Scientific calculations: Results from
math.sin(),math.exp(), etc. - Data processing: NumPy/Pandas operations with floating-point data
# Clone the repository
git clone https://github.com/mushu-dev/bcb-deep-equal.git
cd bcb-deep-equal
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest
# Run tests with coverage
pytest --cov=bcb_deep_equal# Format code
black src tests
# Lint code
ruff check src tests
# Type checking
mypy srcContributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This package was created to address floating-point comparison issues discovered during BigCodeBench-based evaluation work connected to Factor(UT). The original issue and discussion are documented in Issue #4 of the factor-ut-untrusted-decomposer project.
If you use this package in your research, please cite:
@software{bcb-deep-equal,
author = {Sandoval, Aaron},
title = {BCB Deep Equal: Floating-point tolerant comparison for BigCodeBench},
year = {2025},
url = {https://github.com/edward-lcl/bcb-deep-equal}
}