If you discover a security vulnerability in EVA-Bench, please report it privately by emailing the maintainers rather than opening a public issue.
EVA-Bench uses Python eval() in two locations for evaluating condition expressions:
src/eva_bench/scorer/sentinels.py—_eval_logic_condition()evaluates sentinel trigger conditions.src/eva_bench/simulator/engine.py—_eval_trigger_condition()evaluates contingency injection triggers.
Trust model: All condition strings are authored by benchmark maintainers in committed task JSON files. They are not derived from model output, user input, or any external source. The eval() calls:
- Clear
__builtins__to prevent access to Python built-ins. - Inject only safe helper functions (comparisons, state lookups).
- Are bounded to simple boolean expressions.
If you extend EVA-Bench with user-supplied or model-generated condition strings, replace eval() with a safe expression evaluator such as simpleeval or an AST-based parser.
API keys for model providers (OpenAI, Anthropic, Google) are loaded from environment variables via .env files. The .env file is excluded from version control via .gitignore. The included .env.example contains only placeholder values.
Task JSON files, traces, and scoring results contain no personally identifiable information. The NASA corpus consists of publicly available documents from the NASA Technical Reports Server (NTRS).