Skip to content

akshan-main/equitas-benchmark

Repository files navigation

Equitas

Corruption-Robust Aggregation for Multi-LLM Governance Committees

A benchmark for evaluating aggregation strategies in hierarchical multi-LLM committees under adversarial corruption.

Quick Start

pip install equitas-benchmark          # from PyPI
# or for local development:
pip install -e .
python -m equitas --config configs/governance_sweep_fh.yaml

Aggregation Methods (8 baselines + oracle)

Method Key Idea
Oracle Hindsight-optimal action (upper bound)
Multiplicative Weights w *= exp(-eta * loss), adapts to corruption
Supervisor Rerank Follow-the-leader: re-rank by best recent agent
Confidence-Weighted Weight by self-reported confidence
EMA Trust Exponential moving average of past performance
Trimmed Vote Drop outlier agents, then majority
Majority Vote Equal-weight plurality
Oracle Upper Bound Best single agent in hindsight
Random Dictator Uniformly random agent each round

Experiments

  1. Corruption sweep: rate x adversary type x aggregator
  2. Pareto sweep: welfare-fairness tradeoff via (alpha, beta)
  3. Recovery: mid-run corruption onset, track MW weight recovery
  4. Scaling: committee size N in {3, 5, 7, 10}
  5. Hierarchical vs flat: architecture comparison

Reproducibility

Raw experiment outputs in outputs/ include historical runs with all methods tested during development (including self_consistency). The reported benchmark results exclude self_consistency at the analysis layer: table-generation scripts (scripts/generate_benchmark_tables.py, scripts/generate_go_vs_fh_tables.py) and figure-generation (regenerate_figures.py) filter it out on read. The self_consistency aggregator is also hard-disabled in the codebase (equitas/config.py raises ValueError if used) because it implements a committee-level subsampled majority vote, not canonical within-agent self-consistency sampling. See the future-work discussion in the paper.

To regenerate all artifacts from raw data:

python scripts/generate_benchmark_tables.py   # tables/benchmark/
python scripts/generate_go_vs_fh_tables.py    # tables/
python regenerate_figures.py                   # paper/figures/
python -m pytest tests/ -q                    # 88 tests

Project Structure

equitas/          # pip-installable package
  agents/         # LLM client, member/leader/judge/governor agents
  aggregators/    # 8 aggregation strategies (registry pattern)
  adversaries/    # 4 adversary types (selfish, coordinated, scheduled, deceptive)
  metrics/        # fairness, welfare, Pareto, robust statistics
  simulation/     # hierarchical + flat engine
  experiments/    # sweep, recovery, scaling, pareto, hier-vs-flat
  plotting/       # paper-quality matplotlib figures
configs/          # YAML experiment configs
scripts/          # table generation, analysis
paper/            # LaTeX source + figures
tests/            # 88 unit + integration tests

Links

About

Corruption-robustness benchmark for hierarchical multi-LLM committees

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors