Sleuth - AI Bias Detector

Brand Note: Sleuth is the product name. The technical identifier circular-bias-detection (GitHub/PyPI) refers to the methodology we implement.

<<<<<<< HEAD

Detect AI Evaluation Bias in 30 Seconds

Quick Links

Live Demo: https://is.gd/check_sleuth
5分钟上手: 参见 README 的“5分钟上手”小节
示例数据: data/tiny_sample.csv
Examples: examples/
API/Docs: docs/
Software DOI: https://doi.org/10.5281/zenodo.17201032
Dataset DOI: https://doi.org/10.5281/zenodo.17196639

Screenshot

Stop deploying AI models with inflated performance scores.

Sleuth catches when you've been tweaking hyperparameters, prompts, or datasets until your benchmark numbers look good—a hidden form of bias that breaks AI evaluations.

origin/main

🚀 Detect AI Evaluation Bias in 30 Seconds

Stop deploying AI models with inflated performance scores.
Sleuth detects hidden bias caused by tweaking hyperparameters, prompts, or datasets during evaluation—breaking circular reasoning in AI benchmarks.

🔗 Quick Links

Live Demo: Try Sleuth Now →
Sample Data: data/sample_data.csv
Examples: examples/
Documentation: docs/
Software DOI: 10.5281/zenodo.17201032
Dataset DOI: 10.5281/zenodo.17196639

⚡ Quick Start

Option 1: Web App (Fastest – No Install)

✅ Try Sleuth in Browser →

Upload CSV or use sample data
Instant results + downloadable report
100% private, zero setup

Option 2: Python Library

pip install circular-bias-detector

from circular_bias_detector import SimpleBiasDetector
import numpy as np

performance = np.array([[0.85, 0.78], [0.87, 0.80], [0.91, 0.84]])
constraints = np.array([[512, 0.7], [550, 0.75], [600, 0.8]])

detector = SimpleBiasDetector()
result = detector.quick_check(performance, constraints)

if result['has_bias']:
    print(f"⚠️ {result['risk_level'].upper()} RISK: {result['recommendation']}")
else:
    print("- No bias detected—safe to publish!")

Option 3: CLI Tool

pip install circular-bias-detector[cli]
circular-bias detect my_data.csv --format json --output results.json

🧠 How It Works

Sleuth evaluates three statistical indicators:

Indicator	Meaning	Threshold
PSI	Performance-Structure Independence	> 0.15 → unstable
CCS	Constraint Consistency Score	< 0.85 → inconsistent
ρ_PC	Performance–Constraint Correlation	\|ρ\| > 0.5 → suspicious

A bias alert is triggered if any indicator exceeds its threshold.

📊 Sample Output

🔴 BIAS DETECTED - HIGH RISK

PSI: 0.18 (>0.15) — Hyperparameters changed during eval  
CCS: 0.82 (<0.85) — Inconsistent resource limits  
ρ_PC: 0.65 (>0.50) — Performance correlates with constraints  

RECOMMENDATION:
1. Lock all hyperparameters (e.g., temperature, max_tokens)
2. Use identical evaluation settings across runs
3. Re-evaluate with fixed protocol

📁 Data Format (CSV)

<<<<<<< HEAD

1️⃣ Web App (Easiest - No Code)

Launch Sleuth Web App →

Upload CSV or use sample data
Click "Detect Bias"
Get instant results
Download report

Perfect for: Quick checks, demos, non-programmers

2️⃣ Python Library (Most Flexible)

from circular_bias_detector import SimpleBiasDetector
import numpy as np

# Your evaluation data
performance = np.array([[0.85, 0.78], [0.87, 0.80], [0.91, 0.84]])
constraints = np.array([[512, 0.7], [550, 0.75], [600, 0.8]])

# Detect bias
detector = SimpleBiasDetector()
result = detector.quick_check(performance, constraints)

if result['has_bias']:
    print(f"⚠️ {result['risk_level'].upper()} RISK")
    print(result['recommendation'])
else:
    print("- Safe to deploy")

Perfect for: Jupyter notebooks, automated workflows, custom integrations

3️⃣ CLI Tool (Best for Automation)

# Install
pip install circular-bias-detector[cli]

# Detect bias
circular-bias detect my_evaluation_data.csv

# Get JSON output for CI/CD
circular-bias detect data.csv --format json --output results.json

Perfect for: CI/CD pipelines, batch processing, command-line workflows

🎓 Real-World Use Cases

Use Case 1: LLM Evaluation

Problem: Adjusted temperature 30 times until GPT-4 benchmark scores improved
Detection: ρ_PC = 0.72 (high correlation between sampling params and performance)
Fix: Lock temperature=0.7, re-evaluate - real score 3% lower but trustworthy

Use Case 2: Computer Vision

Problem: Changed dataset size from 10K - 50K - 100K samples
Detection: CCS = 0.68 (inconsistent constraints across iterations)
Fix: Fix dataset to 50K, re-run all models - fair comparison achieved

Use Case 3: Pre-Publication Check

Problem: PhD student worried about reviewer rejection
Detection: All 3 indicators green - Result: Paper accepted to NeurIPS with no evaluation concerns

📊 Dataset & Examples

Quick Start Data

Sample datasets included:

data/sample_data.csv - Basic example (ImageNet evaluations)
data/llm_eval_sample.csv - LLM evaluation (GPT, Llama, Claude, Mistral)

Try them in Sleuth: Launch Sleuth →

Full Research Dataset (200K+ Records)

For academic research, access the complete dataset on Zenodo:

📦 Computer Vision: ImageNet classification evaluations
📦 NLP: GLUE benchmark sequences
📦 Recommender Systems: MovieLens-100K protocols
📦 Simulations: 13 controlled bias scenarios

Download from Zenodo →

CSV Data Format (Simple!)

Your CSV should have these columns:

Field	Type	Description
`time_period`	int	Sequential evaluation period (1, 2, 3, ...)
`algorithm`	str	Algorithm name
`performance`	float	Performance metric (0-1 scale)
`constraint_compute`	float	Computational resource limit
`constraint_memory`	float	Memory limit (GB)
`constraint_dataset_size`	int	Training dataset size
`evaluation_protocol`	str	Protocol version identifier

Example CSV:

time_period,algorithm,performance,constraint_compute,constraint_memory
1,ModelA,0.85,512,8.0
1,ModelB,0.78,512,8.0
2,ModelA,0.87,550,8.5
2,ModelB,0.80,550,8.5

See full example: data/sample_data.csv

💻 Installation & Usage

Quick Install

pip install circular-bias-detector

Full Install (from source)

git clone https://github.com/hongping-zh/circular-bias-detection.git
cd circular-bias-detection
pip install -r requirements.txt

5-Minute Tutorial

import pandas as pd
from circular_bias_detector import SimpleBiasDetector

# 1. Load your evaluation data
df = pd.read_csv('data/sample_data.csv')

# 2. Prepare matrices
performance = df.pivot('time_period', 'algorithm', 'performance').values
constraints = df.groupby('time_period')[['constraint_compute', 'constraint_memory']].first().values

# 3. Detect bias
detector = SimpleBiasDetector()
result = detector.quick_check(performance, constraints)

# 4. Check results
if result['has_bias']:
    print(f"🔴 {result['risk_level'].upper()}")
    print(result['recommendation'])
else:
    print("- No bias detected")

# 5. Get detailed report
print(detector.generate_simple_report(result))

More examples: See examples/ directory

Advanced Usage: Bootstrap Confidence Intervals

NEW: Compute statistical significance with bootstrap resampling (n=1000):

from circular_bias_detector.core import (
    bootstrap_psi, 
    bootstrap_ccs, 
    bootstrap_rho_pc,
    compute_adaptive_thresholds
)

# Bootstrap confidence intervals and p-values
psi_results = bootstrap_psi(performance_matrix, n_bootstrap=1000)
ccs_results = bootstrap_ccs(constraint_matrix, n_bootstrap=1000)
rho_results = bootstrap_rho_pc(performance_matrix, constraint_matrix, n_bootstrap=1000)

# Display with confidence intervals
print(f"PSI = {psi_results['psi']:.4f} "
      f"[{psi_results['ci_lower']:.4f}-{psi_results['ci_upper']:.4f}], "
      f"p={psi_results['p_value']:.3f}")

print(f"CCS = {ccs_results['ccs']:.4f} "
      f"[{ccs_results['ci_lower']:.4f}-{ccs_results['ci_upper']:.4f}], "
      f"p={ccs_results['p_value']:.3f}")

print(f"ρ_PC = {rho_results['rho_pc']:+.4f} "
      f"[{rho_results['ci_lower']:+.4f}-{rho_results['ci_upper']:+.4f}], "
      f"p={rho_results['p_value']:.3f}")

# Compute data-adaptive thresholds (95th percentile)
adaptive = compute_adaptive_thresholds(
    performance_matrix, 
    constraint_matrix,
    quantile=0.95
)

print(f"\nAdaptive Thresholds:")
print(f"  PSI:  {adaptive['psi_threshold']:.4f}")
print(f"  CCS:  {adaptive['ccs_threshold']:.4f}")
print(f"  ρ_PC: {adaptive['rho_pc_threshold']:.4f}")

Example output:

PSI = 0.0238 [0.0113-0.0676], p=0.355
CCS = 0.8860 [0.8723-0.9530], p=0.342
ρ_PC = +0.9983 [+0.9972-+1.0000], p=0.772

Adaptive Thresholds:
  PSI:  0.0625
  CCS:  0.8860
  ρ_PC: 0.9983

See examples/bootstrap_example.py for a complete demonstration with LLM evaluation data.

Enhanced API: Integrated Bootstrap and Adaptive Thresholds

NEW: Use BiasDetector with built-in bootstrap and adaptive thresholds:

detector = BiasDetector()

# Enable bootstrap confidence intervals
results = detector.detect_bias(
    performance_matrix=performance_matrix,
    constraint_matrix=constraint_matrix,
    algorithm_names=algorithms,
    enable_bootstrap=True,        # - Add CI and p-values
    n_bootstrap=1000,
    enable_adaptive_thresholds=True  # - Data-driven thresholds
)

# Results now include bootstrap statistics
print(f"PSI: {results['psi_score']:.4f} "
      f"[{results['psi_ci_lower']:.4f}-{results['psi_ci_upper']:.4f}], "
      f"p={results['psi_pvalue']:.3f}")

# Generate enhanced report
report = detector.generate_report(results)
print(report)  # Includes CI and significance stars

Data Validation and Auto-Cleaning

NEW: Automatically detect and fix data quality issues:

from circular_bias_detector.utils import (
    validate_and_clean_data,
    print_validation_report
)

# Load raw data
df = pd.read_csv('raw_data.csv')

# Validate and clean
df_clean, report = validate_and_clean_data(
    df,
    performance_cols=['algorithm'],
    constraint_cols=['constraint_compute', 'constraint_memory'],
    time_col='time_period',
    algorithm_col='algorithm',
    auto_fix=True  # Automatically fix issues
)

# Print report
print_validation_report(report)
# Output:
# Data Quality Score: 85.0/100 ⚠️  GOOD
# Issues fixed:
#  - missing_values: forward_fill_then_mean
#  - outliers: IQR_clipping

Enhanced Visualizations

NEW: Generate publication-quality figures and interactive dashboards:

from circular_bias_detector.visualization import (
    plot_performance_heatmap,
    plot_constraint_heatmap,
    plot_interactive_dashboard,
    plot_correlation_matrix
)

# 1. Performance heatmap
plot_performance_heatmap(
    performance_matrix,
    algorithm_names=algorithms,
    save_path='performance_heatmap.png'
)

# 2. Interactive Plotly dashboard (with hover tooltips)
plot_interactive_dashboard(
    performance_matrix,
    constraint_matrix,
    results,
    algorithm_names=algorithms,
    save_html='dashboard.html'  # Open in browser
)

# 3. Correlation matrix
plot_correlation_matrix(
    performance_matrix,
    constraint_matrix,
    save_path='correlation.png'
)

See examples/visualization_example.py for complete code.

LLM Evaluation Example

Analyze bias in large language model benchmarking:

# Load LLM evaluation data
df = pd.read_csv('data/llm_eval_sample.csv')

# Include LLM-specific constraints
constraint_matrix = df.groupby('time_period')[[
    'constraint_compute',
    'constraint_memory', 
    'constraint_dataset_size',
    'max_tokens',           # LLM-specific
    'temperature'           # LLM-specific
]].first().values

# Detect if prompt engineering inflated scores
results = detector.detect_bias(performance_matrix, constraint_matrix)

# High ρ_PC suggests sampling parameters were tuned to improve scores
if abs(results['rho_pc_score']) > 0.5:
    print("⚠️  High correlation detected: sampling parameters may have been "
          "iteratively adjusted to inflate benchmark scores.")

📁 Repository Structure

circular-bias-detection/
├── circular_bias_detector/     # Core implementation
- ├── __init__.py
- ├── core.py                 # PSI, CCS, ρ_PC algorithms
- ├── detection.py            # Main detection framework
- └── utils.py                # Utility functions
├── circular_bias_cli/          # CLI tool
- ├── main.py                 # CLI entry point
- ├── adapters/               # Bridge to core library
- └── utils/                  # Zenodo loader, etc.
├── web-app/                    # Web application
- ├── src/                    # React components
- └── public/                 # Static assets
├── examples/                   # Usage examples
- ├── reproduce_simulations.py
- ├── reproduce_case_studies.py
- └── basic_usage_example.py
├── tests/                      # Test suite
- └── test_basic.py
├── data/                       # Sample datasets
- └── sample_data.csv
├── requirements.txt            # Dependencies
├── setup.py                    # Package installation
├── LICENSE                     # CC-BY-4.0 License
└── README.md                   # This file

💻 CLI Tool

Installation

# Install with CLI dependencies
pip install circular-bias-detector[cli]

# Or install from source
git clone https://github.com/hongping-zh/circular-bias-detection.git
cd circular-bias-detection
pip install -e .[cli]

Quick Start

# Analyze local CSV file
circular-bias detect data/sample_data.csv

# Use Zenodo dataset (automatic download & caching)
circular-bias detect zenodo://17201032

# Use CBD Dataset v3/v3.1 (auto-selects largest CSV)
circular-bias detect zenodo://17637303

# Specify algorithm and thresholds
circular-bias detect data.csv --algorithm psi --psi-threshold 0.2

# Export results as JSON
circular-bias detect data.csv --format json --output results.json

Available Commands

`detect` - Run bias detection

circular-bias detect <data-source> [options]

# Data sources:
#   - Local file: data/my_data.csv
#   - Zenodo: zenodo://17201032
#   - CBD Dataset v3/v3.1: zenodo://17637303
#   - Zenodo specific file: zenodo://17201032/scenario_high_bias.csv

# Options:
#   --algorithm {psi,ccs,rho_pc,decision}  Algorithm to run (default: decision)
#   --psi-threshold FLOAT                   PSI threshold (default: 0.15)
#   --ccs-threshold FLOAT                   CCS threshold (default: 0.85)
#   --rho-threshold FLOAT                   ρ_PC threshold (default: 0.5)
#   --format {text,json,csv}                Output format (default: text)
#   --output FILE                           Save results to file

Example:

circular-bias detect zenodo://17201032 \
    --algorithm decision \
    --psi-threshold 0.15 \
    --format json \
    --output results.json

`info` - Show dataset information

circular-bias info <source>

# Examples:
circular-bias info zenodo://17201032
circular-bias info data/sample_data.csv

`cache` - Manage cached data

# List cached datasets
circular-bias cache list

# Clear all cache
circular-bias cache clear

# Clear specific dataset
circular-bias cache clear --record-id 17201032

`list-algorithms` - Show available algorithms

circular-bias list-algorithms

CLI Output Example

============================================================
CIRCULAR BIAS DETECTION RESULTS
============================================================

PSI Score:  0.0158
CCS Score:  0.9422
ρ_PC Score: +0.9921

Overall Bias Detected: NO - Confidence: 33.3%

Interpretation:
No circular bias detected (confidence: 33.3%). 
Evaluation appears sound.

Details:
  algorithms_evaluated: ['ResNet', 'VGG', 'DenseNet', 'EfficientNet']
  time_periods: 5
  indicators_triggered: 1
============================================================

Data Format Requirements

CSV file must contain these columns:

Required columns:

origin/main

Column	Type	Description
`time_period`	int	Evaluation round (1, 2, 3, ...)
`algorithm`	str	Model/algorithm name
`performance`	float	Metric score (0–1)
`constraint_compute`	float	Compute limit (e.g., FLOPs)
`constraint_memory`	float	Memory (GB)
`constraint_dataset_size`	int	Training set size (optional)

👉 See data/sample_data.csv for a working example.

🛠 Advanced Features (v1.1+)

Bootstrap CIs & p-values (n=1000) for statistical rigor
Adaptive thresholds via permutation testing
Auto data validation: fixes missing values, outliers, duplicates
Publication-ready visualizations: heatmaps, correlation plots, interactive dashboards
LLM-specific support: detects prompt/sampling tuning bias

detector = BiasDetector(enable_bootstrap=True, enable_adaptive_thresholds=True)
results = detector.detect_bias(performance_matrix, constraint_matrix)
print(detector.generate_report(results))

📚 Documentation & Examples

Offline Usage: docs/usage_offline.md
Real-time/Web: docs/usage_realtime.md
FAQ / Glossary / Contributing: FAQ.md, GLOSSARY.md, CONTRIBUTING.md
Full Examples: examples/
- basic_usage_example.py
- bootstrap_example.py
- llm_evaluation_example.py
- visualization_example.py

🧪 Testing & Reproduction

# Run tests
python -m pytest tests/

# Reproduce paper results
python examples/reproduce_simulations.py
python examples/reproduce_case_studies.py

📦 Installation

# Core library
pip install circular-bias-detector

# With CLI
pip install circular-bias-detector[cli]

# From source
git clone https://github.com/hongping-zh/circular-bias-detection.git
cd circular-bias-detection
pip install -e .

📈 Performance

Sleuth detects bias with high accuracy:

93.2% in synthetic scenarios
89% in computer vision
87% in NLP benchmarks
91% in recommender systems

🤝 Contributing & Support

<<<<<<< HEAD

CBD Dataset (v3/v3.1) Citation

Please cite the dataset alongside the software when you use the CBD data in research or products.

Concept DOI (all versions): https://doi.org/10.5281/zenodo.17637302
Version DOI (this release v3/v3.1): https://doi.org/10.5281/zenodo.17637303

APA (Dataset, version DOI):

Hongping Zhang, & CBD Project Team. (2025). Circular Bias Detection (CBD) dataset and evaluation protocols (v3 / v3.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17637303

中文（数据集，版本 DOI）：

Hongping Zhang，& CBD 项目组. (2025). 循环偏差检测（CBD）数据集与评测协议（v3 / v3.1）[数据集]. Zenodo. https://doi.org/10.5281/zenodo.17637303

Acknowledgments

This project was developed to address a critical gap in AI evaluation integrity. We thank:

The open-source community for foundational libraries (NumPy, SciPy, Pandas)
Early adopters and users who provided valuable feedback
Zenodo for free dataset archiving and DOI assignment

🤝 Contributing

We welcome contributions! Here's how you can help:

Reporting Issues:

Use the GitHub issue tracker
Provide a clear description and reproducible example
Include system information and error messages

Contributing Code:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes and add tests
Ensure all tests pass (python -m pytest tests/)
Commit your changes (git commit -m 'Add amazing feature')
Push to your branch (git push origin feature/amazing-feature)
Open a Pull Request

Seeking Support:

Check existing issues and documentation
Open a new issue with the question label
Email: yujjam@uest.edu.gr =======
Report bugs: GitHub Issues
Ask questions: Discussions or email yujjam@uest.edu.gr
Contribute: Fork → Branch → Test → PR (see CONTRIBUTING.md)

origin/main

📄 License

Code: MIT License
Docs & Dataset: CC BY 4.0
✅ Free for academic and commercial use with attribution

📖 Citation

If you use Sleuth, please cite:

@software{zhang2024sleuth,
  author    = {Zhang, Hongping},
  title     = {Sleuth: Circular Bias Detection for AI Evaluations},
  year      = {2024},
  version   = {v1.0.0},
  doi       = {10.5281/zenodo.17201032},
  url       = {https://github.com/hongping-zh/circular-bias-detection}
}

Dataset citation (if used):
DOI: 10.5281/zenodo.17196639

👤 Author

Hongping Zhang
📧 yujjam@uest.edu.gr | ORCID

🚀 Ready to Detect Bias?

Try Web App • Star on GitHub • Read Docs

_{Empowering researchers worldwide to ensure AI evaluation integrity}

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github		.github
backend		backend
cbd		cbd
check-sleuth-ai		check-sleuth-ai
circular_bias_cli		circular_bias_cli
circular_bias_detector		circular_bias_detector
data		data
docs		docs
examples		examples
experiments		experiments
mvp_case_study_figures		mvp_case_study_figures
patent		patent
requirements		requirements
scripts		scripts
sglang		sglang
sglang-integration		sglang-integration
simulations		simulations
src/cbd		src/cbd
tests		tests
web-app-files		web-app-files
web-app		web-app
website		website
.coveragerc		.coveragerc
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
codecov.yml		codecov.yml
codemeta.json		codemeta.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_cbd_test.py		run_cbd_test.py
setup.py		setup.py
test_v1.3_features.py		test_v1.3_features.py
test_zenodo_17637303.py		test_zenodo_17637303.py
vercel.json		vercel.json

License

hongping-zh/circular-bias-detection

Folders and files

Latest commit

History

Repository files navigation

Sleuth - AI Bias Detector

Detect AI Evaluation Bias in 30 Seconds

On this page

Quick Links

Screenshot

Sleuth catches when you've been tweaking hyperparameters, prompts, or datasets until your benchmark numbers look good—a hidden form of bias that breaks AI evaluations.

🚀 Detect AI Evaluation Bias in 30 Seconds

🔗 Quick Links

⚡ Quick Start

Option 1: Web App (Fastest – No Install)

Option 2: Python Library

Option 3: CLI Tool

🧠 How It Works

📊 Sample Output

📁 Data Format (CSV)

1️⃣ Web App (Easiest - No Code)

2️⃣ Python Library (Most Flexible)

3️⃣ CLI Tool (Best for Automation)

🎓 Real-World Use Cases

Use Case 1: LLM Evaluation

Use Case 2: Computer Vision

Use Case 3: Pre-Publication Check

📊 Dataset & Examples

Quick Start Data

Full Research Dataset (200K+ Records)

CSV Data Format (Simple!)

💻 Installation & Usage

Quick Install

Full Install (from source)

5-Minute Tutorial

Advanced Usage: Bootstrap Confidence Intervals

Enhanced API: Integrated Bootstrap and Adaptive Thresholds

Data Validation and Auto-Cleaning

Enhanced Visualizations

LLM Evaluation Example

📁 Repository Structure

💻 CLI Tool

Installation

Quick Start

Available Commands

detect - Run bias detection

info - Show dataset information

cache - Manage cached data

list-algorithms - Show available algorithms

CLI Output Example

Data Format Requirements

CSV file must contain these columns:

🛠 Advanced Features (v1.1+)

📚 Documentation & Examples

🧪 Testing & Reproduction

📦 Installation

📈 Performance

🤝 Contributing & Support

CBD Dataset (v3/v3.1) Citation

Acknowledgments

🤝 Contributing

📄 License

📖 Citation

👤 Author

🚀 Ready to Detect Bias?

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

`detect` - Run bias detection

`info` - Show dataset information

`cache` - Manage cached data

`list-algorithms` - Show available algorithms

Packages