-
-
Notifications
You must be signed in to change notification settings - Fork 806
feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 All Clear
I reviewed this PR for LLM security vulnerabilities, focusing on the new guardrails-only mode feature. The changes add a configuration flag that bypasses LLM-based grading and instead checks only whether responses were blocked by guardrails. No LLM security vulnerabilities were identified.
Minimum severity threshold for this scan: 🟡 Medium | Learn more
📝 WalkthroughWalkthroughThis pull request introduces a guardrails-only mode for red team grading via a new environment variable PROMPTFOO_REDTEAM_GUARDRAILS_ONLY. The changes add this flag to the environment variable definitions, implement the guardrails-only evaluation logic in the red team grader base class (which bypasses LLM-based rubric evaluation and determines pass/fail based on whether a response was blocked), and propagate providerResponse through grader.getResult calls across seven different red team provider implementations. The feature includes type definitions for the new context field and documentation of the new environment variable. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
site/docs/usage/command-line.md (1)
728-728: Consider rephrasing to avoid adverb repetition.The description uses "only" twice in close proximity. Consider rephrasing for better readability:
-| `PROMPTFOO_REDTEAM_GUARDRAILS_ONLY` | Short-circuits red team grading to only test guardrails. Skips LLM-based grading and checks if response was blocked: PASS = blocked (guardrails.flagged, isRefusal, or text refusal), FAIL = not blocked. Useful for fast guardrail validation. | `false` | +| `PROMPTFOO_REDTEAM_GUARDRAILS_ONLY` | Short-circuits red team grading to test guardrails exclusively. Skips LLM-based grading and checks if response was blocked: PASS = blocked (guardrails.flagged, isRefusal, or text refusal), FAIL = not blocked. Useful for fast guardrail validation. | `false` |src/assertions/redteam.ts (1)
44-53: The undefined parameters explicitly skipadditionalRubricandskipRefusalCheckoptional parameters to passgradingContext.The two
undefinedvalues at lines 50-51 correspond to the optional parametersadditionalRubric?: stringandskipRefusalCheck?: booleanin thegetResultmethod signature. While this pattern is functional and used elsewhere in the codebase, it reduces readability by obscuring the intent to skip intermediate optional parameters.Consider extracting the context object to make the skipped parameters explicit:
const gradingContext = { providerResponse }; const { grade, rubric, suggestions } = await grader.getResult( prompt, outputString, test, provider, renderedValue, undefined, undefined, gradingContext, );
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
site/docs/usage/command-line.md(1 hunks)src/assertions/redteam.ts(1 hunks)src/envars.ts(1 hunks)src/redteam/plugins/base.ts(4 hunks)src/redteam/providers/crescendo/index.ts(1 hunks)src/redteam/providers/custom/index.ts(1 hunks)src/redteam/providers/goat.ts(1 hunks)src/redteam/providers/hydra/index.ts(1 hunks)src/redteam/providers/iterative.ts(1 hunks)src/redteam/providers/iterativeMeta.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use TypeScript with strict type checking
Files:
src/envars.tssrc/redteam/providers/custom/index.tssrc/redteam/providers/crescendo/index.tssrc/redteam/providers/goat.tssrc/assertions/redteam.tssrc/redteam/plugins/base.tssrc/redteam/providers/hydra/index.tssrc/redteam/providers/iterative.tssrc/redteam/providers/iterativeMeta.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx,js,jsx}: Follow consistent import order - Biome will handle import sorting
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs
Use the logger methods (debug,info,warn,error) with an optional second parameter for context objects that will be automatically sanitized
UsesanitizeObjectfrom./util/sanitizerfor manual sanitization of data before using it in non-logging contexts
Files:
src/envars.tssrc/redteam/providers/custom/index.tssrc/redteam/providers/crescendo/index.tssrc/redteam/providers/goat.tssrc/assertions/redteam.tssrc/redteam/plugins/base.tssrc/redteam/providers/hydra/index.tssrc/redteam/providers/iterative.tssrc/redteam/providers/iterativeMeta.ts
site/docs/**/*.md
📄 CodeRabbit inference engine (site/docs/AGENTS.md)
site/docs/**/*.md: Do not modify existing documentation headings as they are often externally linked
Use 'eval' instead of 'evaluation' in all documentation and code references
Use 'Promptfoo' (capitalized) at the start of sentences and headings, and 'promptfoo' (lowercase) in code, commands, and package names
Every documentation page must include front matter with title (under 60 characters), description (150-160 characters), and sidebar_position fields
Only add titles to complete, runnable code blocks; do not add titles to code fragments
Use comment directives (highlight-next-line, highlight-start/highlight-end) to highlight important lines in code blocks
Never remove existing highlight directives when editing code blocks
Use admonition blocks (:::note, :::warning, :::danger) for important information and always include empty lines around content inside admonitions
Write documentation with clear, concise language using active voice, spell out acronyms on first use, and write for an international audience avoiding idioms
Files:
site/docs/usage/command-line.md
src/redteam/**/*.ts
📄 CodeRabbit inference engine (src/redteam/AGENTS.md)
src/redteam/**/*.ts: Always sanitize when logging red team test content; the second parameter to logger functions is auto-sanitized for harmful/sensitive content
Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance
Files:
src/redteam/providers/custom/index.tssrc/redteam/providers/crescendo/index.tssrc/redteam/providers/goat.tssrc/redteam/plugins/base.tssrc/redteam/providers/hydra/index.tssrc/redteam/providers/iterative.tssrc/redteam/providers/iterativeMeta.ts
src/redteam/plugins/*.ts
📄 CodeRabbit inference engine (src/redteam/AGENTS.md)
src/redteam/plugins/*.ts: ImplementRedteamPluginObjectinterface when adding new plugins to the red team testing framework
Generate targeted test cases for specific vulnerabilities in red team plugins
Include assertions defining failure conditions in red team plugin test cases
Referencesrc/redteam/plugins/pii.tsas the pattern for implementing new red team plugins
Files:
src/redteam/plugins/base.ts
🧠 Learnings (19)
📓 Common learnings
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-30T04:53:04.572Z
Learning: Pull request titles must follow Conventional Commits format with types (feat/fix/chore/refactor/docs/test/ci/revert/perf) and scopes, with `redteam` being mandatory for all redteam-related changes
📚 Learning: 2025-11-29T00:26:38.602Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:38.602Z
Learning: Applies to test/**/*.test.ts : Use Jest for core promptfoo functionality testing (not Vitest), configured via jest.config.ts, importing from jest/globals
Applied to files:
src/envars.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new red team plugins in the `test/redteam/` directory
Applied to files:
src/envars.tssrc/assertions/redteam.tssrc/redteam/plugins/base.ts
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement
Applied to files:
src/envars.tssite/docs/usage/command-line.md
📚 Learning: 2025-07-18T17:25:38.445Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/README.md : For each environment variable, explain its purpose, how to obtain it, and any default values or constraints in the README
Applied to files:
site/docs/usage/command-line.md
📚 Learning: 2025-11-29T00:25:12.625Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:12.625Z
Learning: Applies to examples/*/README.md : Document required environment variables in README with format: list variable name and description
Applied to files:
site/docs/usage/command-line.md
📚 Learning: 2025-11-29T00:25:12.625Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:12.625Z
Learning: Applies to examples/**/promptfooconfig.yaml : In promptfooconfig.yaml, maintain strict field order: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Applied to files:
site/docs/usage/command-line.md
📚 Learning: 2025-07-18T17:25:38.444Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Follow the specific field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests
Applied to files:
site/docs/usage/command-line.md
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`
Applied to files:
src/redteam/providers/custom/index.tssrc/redteam/providers/crescendo/index.tssrc/redteam/providers/goat.tssrc/assertions/redteam.tssrc/redteam/plugins/base.tssrc/redteam/providers/hydra/index.tssrc/redteam/providers/iterative.tssrc/redteam/providers/iterativeMeta.ts
📚 Learning: 2025-11-29T00:24:17.012Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-29T00:24:17.012Z
Learning: Applies to src/redteam/**/*agent*.{ts,tsx,js,jsx} : Maintain clear agent interface definitions and usage patterns
Applied to files:
src/assertions/redteam.tssrc/redteam/providers/iterativeMeta.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Include assertions defining failure conditions in red team plugin test cases
Applied to files:
src/assertions/redteam.tssrc/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance
Applied to files:
src/assertions/redteam.tssrc/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Implement `RedteamPluginObject` interface when adding new plugins to the red team testing framework
Applied to files:
src/assertions/redteam.tssrc/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins
Applied to files:
src/assertions/redteam.tssrc/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:06.422Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/**/*.ts : Provider implementations must transform promptfoo prompts to provider-specific API format and return normalized `ProviderResponse` for evaluation
Applied to files:
src/assertions/redteam.ts
📚 Learning: 2025-11-29T00:26:16.682Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Reference `src/redteam/plugins/pii.ts` as the pattern for implementing new red team plugins
Applied to files:
src/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:38.602Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:38.602Z
Learning: Applies to test/**/test/providers/*.test.ts : Provider test files must cover: success cases (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking
Applied to files:
src/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:06.422Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/test/providers/**/*.{ts,tsx,js} : Every provider implementation must have comprehensive tests in `test/providers/` directory including success cases, error cases, rate limits, timeouts, and invalid config scenarios
Applied to files:
src/redteam/plugins/base.ts
📚 Learning: 2025-11-29T00:26:06.422Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/test/providers/**/*.{ts,tsx,js} : Provider tests must mock API responses and not call real APIs
Applied to files:
src/redteam/plugins/base.ts
🧬 Code graph analysis (1)
src/redteam/plugins/base.ts (2)
src/envars.ts (1)
getEnvBool(424-433)src/redteam/util.ts (2)
isEmptyResponse(175-183)isBasicRefusal(185-191)
🪛 LanguageTool
site/docs/usage/command-line.md
[style] ~728-~728: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... | Short-circuits red team grading to only test guardrails. Skips LLM-based gradin...
(ADVERB_REPETITION_PREMIUM)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Redteam (Staging API)
- GitHub Check: security-scan
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: webui tests
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Redteam (Production API)
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Share Test
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Test on Node 22.x and ubuntu-latest
- GitHub Check: Build Docs
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Build on Node 22.x
- GitHub Check: Build on Node 20.x
- GitHub Check: Build on Node 24.x
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (11)
src/redteam/providers/custom/index.ts (1)
490-502: LGTM!The grader invocation correctly passes the provider response context for guardrails-only mode checking. The pattern is consistent with other providers in this PR.
src/redteam/providers/iterativeMeta.ts (1)
290-299: LGTM!The provider response is correctly propagated to the grader for guardrails evaluation.
src/redteam/providers/hydra/index.ts (1)
443-452: LGTM!The provider response context is correctly passed to the grader, enabling guardrails-only evaluation mode.
src/redteam/providers/crescendo/index.ts (1)
463-480: LGTM!The grader invocation correctly combines provider response with optional tracing context. The conditional inclusion of tracing data based on
tracingOptions.includeInGradingis well-structured.src/redteam/providers/goat.ts (1)
417-434: LGTM!The grader call correctly passes provider response and conditionally includes tracing data for grading. The pattern matches the Crescendo provider implementation.
src/redteam/providers/iterative.ts (1)
374-391: LGTM!The provider response and optional tracing data are correctly passed to the grader, following the established pattern.
src/redteam/plugins/base.ts (4)
3-3: LGTM!The import of
getEnvBoolis correctly added to support the new guardrails-only mode.
14-14: LGTM!Exporting
ProviderResponsetype makes it available for the extendedRedteamGradingContextinterface.
355-360: LGTM!The addition of
providerResponsetoRedteamGradingContextenables guardrails-only mode to access the full provider response for blocking detection.
428-469: Well-implemented guardrails-only mode.The implementation correctly:
- Uses environment flag to enable the mode
- Safely extracts guardrails signals with fallback defaults
- Combines multiple blocking indicators (guardrails.flagged, isRefusal, text-based refusal/empty)
- Builds detailed reasons listing all triggers
- Returns appropriate pass/fail based on whether the request was blocked
The logic properly handles undefined values and avoids duplicate trigger reporting (line 448).
src/envars.ts (1)
42-48: Feature flag is properly implemented and well-documented.The environment variable declaration and implementation in
src/redteam/plugins/base.tscorrectly match the documented behavior. When enabled, the flag skips LLM-based grading and only checks if the response was blocked through guardrails, provider refusal, or text-based signals. The grading result properly mapspass: truewhen blocked (guardrails worked) andpass: falsewhen not blocked (guardrails didn't trigger). Logging appropriately useslogger.debugwith context.
mode for skipping all the red team graders