Autofix Bot Bench

This repository contains prompts used for benchmarking security analysis and remediation by Autofix Bot and other security review tools.

benchmarks/security-analysis/prompts/
- openai-codex-new-security-prompt.md — Security review prompt used in OpenAI Codex CLI tool to identify vulnerabilities.
- autofix-eval-judge-prompt.md — Judge prompt to evaluate whether a diff fully fixes a vulnerability.
- codex-autofix-prompt.md — Security fix generation instruction used in Codex CLI for a detected vulnerability.
- claude-code-original-security-prompt.md — Original security review prompt used by Claude Code CLI.
- claude-code-modified-security-prompt.md — File-based security review prompt detecting intentional vulnerabilities for benchmarking.
- claude-code-autofix-prompt.md — Security fix generation prompt used in Claude Code CLI to fix a given vulnerability.
- gemini-cli-autofix-prompt.md — Security fix generation prompt used in Gemini CLI to fix a vulnerability.

Notes

In Claude Code, the original security prompt was modified to optimize performance when analyzing large codebases through batched file processing. Using the CLI command claude -p /security-review --permission-mode acceptEdits, the approach processes 10 files per CLI instance, as Claude struggled with larger file counts while single-file analysis proved too memory intensive. The original prompt analyzed entire git diffs, but this caused token limits to be exceeded when processing large datasets. Key modifications include switching from git diff analysis to file-list based analysis to stay within token limits, explicitly instructing Claude to report intentional vulnerabilities in benchmark repositories (which it previously dismissed as non-real), and changing the output format from plain text stdout to structured JSONL file output for benchmarking purposes. The original prompt from the /security-review command was git-centric and designed for general security reviews, while the modified version is optimized for systematic vulnerability detection in testing environments with structured data collection.
In Codex, since it lacks a dedicated security review feature, a custom security prompt was created by passing Claude Code's security review prompt through OpenAI's official prompt generator. The approach uses the CLI command codex exec --sandbox workspace-write < {prompt} to process 15 files at a time in parallel, demonstrating Codex's ability to handle larger batch sizes compared to Claude Code's 10-file limitation.
Refer to Autofix Bot benchmarks page for more information.
For validation of detected security issues, OWASP Benchmark Java repository as ground truth.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmarks/security-analysis/prompts		benchmarks/security-analysis/prompts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Autofix Bot Bench

Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

DeepSourceCorp/autofix-bot-bench

Folders and files

Latest commit

History

Repository files navigation

Autofix Bot Bench

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages