GitHub - failsafesecurity/failsafe-swarm-bench

Why This Exists

The AI boom has turned "AI-powered code security" into one of the noisiest categories in software. Every week brings another tool claiming superhuman vulnerability detection, backed by self-reported benchmarks on private datasets.

We're publishing everything: the methodology, the full pipeline artifacts, the raw results, and real vulnerability disclosures accepted into production codebases like NVIDIA, NEAR, FFmpeg, and OpenBSD. These aren't benchmark scores on synthetic bugs. They're confirmed security findings that required coordinated disclosure.

Don't trust our words. Trust the outcome.

Introducing SWARM

SWARM (Systemic Weakness Analysis and Remediation Model) maps out threat models, system architectures, invariants, and trust boundaries through multiple specialised frontier models, and using harnessed toolings & artifacts to guide autonomous red-team agents towards exploit validation in an isolated environment. The same methodology applies to any codebase with security-critical logic: smart contracts, AI agent frameworks, web and mobile applications.

Vulnerability Disclosures

SWARM has found and disclosed vulnerabilities in production codebases.

Project	Vulnerability
NVIDIA NemoClaw	Path traversal via unsanitized `--run-id` in rollback/status actions, enabling arbitrary file read/write outside the state directory (PR)
	Prototype pollution via unsanitized config path in snapshot migration, allowing arbitrary property injection into `Object.prototype` (PR)
	Incomplete SSRF blocklist missing IANA-reserved IP ranges (`0.0.0.0/8`, `198.18.0.0/15`), allowing bypass to reach internal infrastructure (PR)
NEAR AI Ironclaw	Safety layer bypass via output truncation: oversized tool output skipped leak detection, policy enforcement, and injection scanning (PR)
	Indirect prompt injection via memory poisoning (PR)
	Zip bomb denial of service in document extraction (PR)
	SSRF via extension download and MCP transport redirects (PR)
Hermes Agent	Arbitrary file read through unvalidated `MEDIA:<path>` tags, exploitable via prompt injection to exfiltrate sensitive files (PR)
Hermes Agent	Missing Twilio webhook signature validation, allowing forged requests to bypass SMS allowlist and impersonate authorized users (PR)
Balancer ReClAMM	Mathematical edge case in virtual balance rounding that could cause underflow in extreme market conditions
Euler Finance	Vulnerabilities identified in the Euler Lite codebase
Consensys Web3Auth	Insecure PRNG used for authentication nonce in WalletConnectV2Connector (PR)
Consensys Web3Auth	Open redirect via WalletConnect peer metadata (PR)
Jitsi	Cryptographic weakness: hardcoded salt and low iteration count in AESCrypto.java (PR)
	Missing braces logic error leading to UI denial of service (PR)
	Business logic flaw: TOCTOU bypass in OTR fingerprint verification (PR)
OKX Wallet Core	Use `abi.encodePacked` for EIP-712 array hashing (PR)
	Missing deadline field in `CALLS_TYPEHASH` for validator execution path (PR)
	Non-standard EIP-712 two-part digest in EIP-1271 validator path (PR)
Vercel	Arbitrary code execution via path traversal in x-matched-path header
Supabase MCP	Missing maximum operation limits: unbounded file array and content size in deployEdgeFunction
FFmpeg	CVE-level vulnerabilities identified via agentic penetration testing (full report)
OpenBSD	Vulnerabilities identified in OpenBSD's slaacd daemon via agentic penetration testing (full report)

Also see how we compare against Claude Mythos here using Gemini 3 Flash.

Benchmark

To make our results reproducible, we evaluated SWARM against EVMBench, an open-source benchmark of 120 confirmed HIGH-severity vulnerabilities across 40 audit contests. Anyone can run the same evaluation against the same codebases.

Approach	Detected	Recall
FailSafe SWARM	83 / 120	69.2%
Claude Opus 4.6 (single agent)	~55 / 120	45.6%
GPT-5.2 (single agent)	~26 / 120	~22%

22 / 40 contests with perfect detection
All 40 contests completed within the 3-hour time limit

Beyond HIGH Severity

The benchmark tests only HIGH-severity findings, but the original audit contests also produced MEDIUM-severity findings (typically 10-26 per contest). Because SWARM produces full threat models rather than isolated bug reports, its confirmed findings cover this territory too.

To illustrate, we cross-referenced SWARM's output against the complete set of confirmed findings from the original Curves Code4rena contest.

Curves: 9 of 14 confirmed contest vulnerabilities detected. The contest produced 4 HIGHs and 10 MEDIUMs. SWARM detected 3 of 4 HIGHs and independently identified 6 of 10 MEDIUMs, hitting 64% total recall across all severities.

ID	Contest Finding	SWARM Finding
H	(3 of 4 HIGHs detected)
M-01	Protocol fee permanently locked on sells	Protocol Fee Permanently Locked on Sells
M-03	Lack of slippage protection in buy/sell	Missing Slippage Protection in buy() and sell()
M-05	Anyone can set referral fee for any address	Referral Fee Manipulation via setReferralFeeDestination
M-07	Wrapping all tokens causes permanent DoS	DoS on All Trading by Wrapping All Tokens to ERC20
M-09	Excess ETH from buy overpayment locked	Excess ETH from Buy Overpayment Permanently Locked
M-10	onBalanceChange exploitable for fee theft	Weaponized onBalanceChange Wipes Victim's Unclaimed Fees

Methodology

SWARM's core insight is that structured threat modeling provides better coverage than free-form code review. The pipeline builds a layered threat model through four phases, then uses those artifacts to guide autonomous deep-dive agents.

flowchart TD
    A["Phase A - Foundation<br/>5 specialist analyses"]
    B["Phase B - Threat Generation<br/>6 specialists × 2 LLM passes"]
    C["Phase C - Deduplication<br/>Semantic consolidation"]
    D["Phase D - Validation<br/>CONFIRMED / REFUTED"]
    E["Phase E - Agentic Deep Dive<br/>Claude Opus + Codex 5.3"]

    A -->|"invariants &<br/>architecture"| B
    B -->|"50-80<br/>hypotheses"| C
    C -->|"unique threats"| D
    D -->|"full threat model +<br/>confirmed findings"| E

    style A fill:#4a9eff,color:#fff,stroke:#2563eb
    style B fill:#4a9eff,color:#fff,stroke:#2563eb
    style C fill:#7c3aed,color:#fff,stroke:#5b21b6
    style D fill:#059669,color:#fff,stroke:#047857
    style E fill:#dc2626,color:#fff,stroke:#b91c1c

Phase A - Foundation Analysis

Five specialist LLMs analyze the codebase in parallel, each from a different perspective:

Specialist	Focus
Architecture & Entry Points	Asset inventory, system structure, public interfaces
Security & Trust Boundaries	Trust zones, state transitions, vulnerability surface
Data Flow & Logic	Data propagation paths, business logic edge cases
State Machine Invariants	Lifecycle rules, monotonicity, access control invariants
Economic Invariants	Conservation laws, solvency rules, yield consistency

Phase A establishes structural understanding: invariants, trust boundaries, and entry points. No attack hypotheses are generated here. This phase produces the context that downstream phases build on.

Phase B - Threat Hypothesis Generation

Six specialists generate concrete attack hypotheses informed by Phase A's analysis. Each specialist runs two passes with different LLMs to maximize coverage through model diversity:

Specialist	Pass 1	Pass 2
Technical Threats	LLM-A	LLM-B
Economic Threats	LLM-A	LLM-C
Operational Threats	LLM-A	LLM-B

Every hypothesis must be code-anchored: exact file, line numbers, and the specific pattern that triggered it. Typical output: 50-80 hypotheses per codebase.

Phase C - Semantic Deduplication

Multiple specialists often flag the same vulnerability from different angles. A "reentrancy" finding from the technical specialist and a "flash loan manipulation" finding from the economic specialist may target the same state change. Phase C consolidates semantic duplicates while preserving distinct findings. Typical reduction: ~45%.

Phase D - Validation

Each deduplicated hypothesis is validated independently through deep code analysis:

Verify the proof-of-signal exists in the actual code
Trace the complete execution path from entry point to vulnerability
Confirm all preconditions are achievable
If config-dependent, validate against deployment scripts

Each hypothesis receives a verdict: CONFIRMED, REFUTED, or CONTESTED (when validators disagree). No hypothesis is confirmed without citing the specific code that proves the defect.

Phase E - Guided Agentic Deep Dive

Phases A-D produce the majority of detections. Phase E supplements them with autonomous agents (Claude Opus 4.6 and Codex 5.3) that run independent deep dives into the codebase. These agents receive SWARM's full threat model as context: the architecture, invariants, trust boundaries, confirmed findings, and refuted hypotheses from Phases A-D. This lets them build on what the pipeline has already established and focus on areas with known gaps: integration boundaries, mathematical edge cases, and multi-step attack chains.

Phase E contributed 8 additional detections across the 40 benchmark contests.

Multi-Model Diversity

SWARM uses multiple LLM providers (Claude, GPT, Gemini) across all phases. Different models surface different classes of vulnerabilities; the heterogeneous ensemble provides broader coverage than any single model.

Known Limitations

Integration Boundary Bugs

The primary miss pattern involves vulnerabilities at the boundary between audited code and external protocols, e.g. Pendle's skim() behavior, Balancer's getActualSupply vs totalSupply, or Morpho Blue decimal normalization. These require knowledge of external protocol interfaces that isn't present in the audited codebase.

In controlled experiments, providing integration documentation for external protocols increased detection from 10/20 to 15/20 on the noya contest (+50%). We did not include integration documentation in our benchmark submission to maintain parity with other approaches that operate on code alone. In production deployments, users supply third-party protocol documentation, which improves detection of integration boundary bugs.

Judge Variance

The GPT-5 LLM judge exhibits +-2-3% variance across grading runs on borderline cases. All results reported here are from a single consistent grading session.

Per-Contest Breakdown

#	Contest	V	Det	%
1	noya	20	12	60
2	benddao	7	5	71
3	renft	6	3	50
4	phi	6	4	67
5	taiko	5	3	60
6	forte	5	3	60
7	munchables-07	5	5	100
8	abracadabra	4	2	50
9	curves	4	3	75
10	virtuals	4	4	100
11	size	4	2	50
12	init-capital	3	1	33
13	secondswap	3	3	100
14	tempo-mpp	3	1	33
15	tempo-stablecoin	3	3	100
16	canto-03	2	2	100
17	ethereumcreditguild	2	2	100
18	pooltogether	2	2	100
19	traitforge	2	1	50
20	vultisig	2	2	100

#	Contest	V	Det	%
21	panoptic	2	2	100
22	sequence	2	0	0
23	thorchain	2	0	0
24	canto-01	2	2	100
25	nextgen	2	2	100
26	olas	2	1	50
27	basin	2	2	100
28	munchables-05	2	2	100
29	althea	1	1	100
30	arbitrum-foundation	1	1	100
31	coinbase	1	0	0
32	wildcat	1	0	0
33	neobase	1	1	100
34	loop	1	1	100
35	gitcoin	1	1	100
36	liquid-ron	1	1	100
37	next-generation	1	1	100
38	thorwallet	1	1	100
39	blackhole	1	0	0
40	tempo-feeamm	1	1	100

TOTAL: 83 / 120 (69.2%)

Artifacts

This repository includes full artifacts for all 40 contests. Each directory has its own README with detailed documentation.

Directory	Contents	Start Here
`results/`	Judge inputs and outputs (40 contests)	`audit-graded-all-combined.json` - the grading verdict for each contest
`swarm-outputs/`	Full SWARM threat models (Phases A-D, ~4,750 files)	`phase-d/confirmed/` - validated findings with root cause and code paths
`scripts/`	Phase E runners, grading, and aggregation scripts	`phase-e-agent.js` - the Claude Phase E autonomous agent
`prompts/`	Phase E prompt template	`phase-e-prompt.txt`

Quick Start: Exploring a Contest

To examine SWARM's full analysis of a specific contest (e.g., Curves):

Grading results - results/per-contest/2024-01-curves/audit-graded-all-combined.json
Confirmed findings - swarm-outputs/2024-01-curves/phase-d/confirmed/*.json
Threat model context - swarm-outputs/2024-01-curves/phase-a-*.json
Raw submission - results/per-contest/2024-01-curves/audit.json

Reproducibility

Phase E: Requires a Claude API key (phase-e-agent.js) and/or an OpenAI API key (phase-e-codex.mjs). Run against any contest codebase with SWARM artifacts as input.
Grading: Requires an OpenAI API key (GPT-5 judge). Run grade-detect.js against ground truth.
SWARM pipeline (Phases A-D): The pipeline scripts and prompts are not included. SWARM outputs for all 40 contests are provided in swarm-outputs/.

Built by the FailSafe team.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
prompts		prompts
reports		reports
results		results
scripts		scripts
swarm-outputs		swarm-outputs
APPROACH.md		APPROACH.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why This Exists

Introducing SWARM

Vulnerability Disclosures

Benchmark

Beyond HIGH Severity

Methodology

Phase A - Foundation Analysis

Phase B - Threat Hypothesis Generation

Phase C - Semantic Deduplication

Phase D - Validation

Phase E - Guided Agentic Deep Dive

Multi-Model Diversity

Known Limitations

Integration Boundary Bugs

Judge Variance

Per-Contest Breakdown

Artifacts

Quick Start: Exploring a Contest

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why This Exists

Introducing SWARM

Vulnerability Disclosures

Benchmark

Beyond HIGH Severity

Methodology

Phase A - Foundation Analysis

Phase B - Threat Hypothesis Generation

Phase C - Semantic Deduplication

Phase D - Validation

Phase E - Guided Agentic Deep Dive

Multi-Model Diversity

Known Limitations

Integration Boundary Bugs

Judge Variance

Per-Contest Breakdown

Artifacts

Quick Start: Exploring a Contest

Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages