feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426

typpo · 2025-12-01T16:10:37Z

mode for skipping all the red team graders

promptfoo-scanner

👍 All Clear

I reviewed this PR for LLM security vulnerabilities, focusing on the new guardrails-only mode feature. The changes add a configuration flag that bypasses LLM-based grading and instead checks only whether responses were blocked by guardrails. No LLM security vulnerabilities were identified.

_{Minimum severity threshold for this scan: 🟡 Medium | Learn more}

coderabbitai · 2025-12-01T16:14:05Z

📝 Walkthrough

Walkthrough

This pull request introduces a guardrails-only mode for red team grading via a new environment variable PROMPTFOO_REDTEAM_GUARDRAILS_ONLY. The changes add this flag to the environment variable definitions, implement the guardrails-only evaluation logic in the red team grader base class (which bypasses LLM-based rubric evaluation and determines pass/fail based on whether a response was blocked), and propagate providerResponse through grader.getResult calls across seven different red team provider implementations. The feature includes type definitions for the new context field and documentation of the new environment variable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

src/redteam/plugins/base.ts: Contains the core guardrails-only grading logic with flagged response detection and blocking determination—requires verification that the refusal detection and rubric construction logic is sound
Provider files consistency: Seven provider files (crescendo, custom, goat, hydra, iterative, iterativeMeta, and assertions/redteam.ts) are updated with similar providerResponse propagation patterns—review should verify all call sites follow the same structure and that tracing context is conditionally included where appropriate
Type and signature changes: Verify that providerResponse field addition to RedteamGradingContext and ProviderResponse type export do not break existing code paths or create type conflicts
Environment variable integration: Confirm getEnvBool utility is correctly used to read the new PROMPTFOO_REDTEAM_GUARDRAILS_ONLY flag and that it gates the guardrails-only execution path properly

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature being added: a new environment variable (PROMPTFOO_REDTEAM_GUARDRAILS_ONLY) for guardrails-only red team grading mode.
Description check	✅ Passed	The description is directly related to the changeset, describing the purpose of skipping red team graders when guardrails-only mode is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ian/20251201-081032

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

site/docs/usage/command-line.md (1)
728-728: Consider rephrasing to avoid adverb repetition.

The description uses "only" twice in close proximity. Consider rephrasing for better readability:
-| `PROMPTFOO_REDTEAM_GUARDRAILS_ONLY`           | Short-circuits red team grading to only test guardrails. Skips LLM-based grading and checks if response was blocked: PASS = blocked (guardrails.flagged, isRefusal, or text refusal), FAIL = not blocked. Useful for fast guardrail validation.                                   | `false`                       |
+| `PROMPTFOO_REDTEAM_GUARDRAILS_ONLY`           | Short-circuits red team grading to test guardrails exclusively. Skips LLM-based grading and checks if response was blocked: PASS = blocked (guardrails.flagged, isRefusal, or text refusal), FAIL = not blocked. Useful for fast guardrail validation.                                   | `false`                       |
src/assertions/redteam.ts (1)
44-53: The undefined parameters explicitly skip additionalRubric and skipRefusalCheck optional parameters to pass gradingContext.

The two undefined values at lines 50-51 correspond to the optional parameters additionalRubric?: string and skipRefusalCheck?: boolean in the getResult method signature. While this pattern is functional and used elsewhere in the codebase, it reduces readability by obscuring the intent to skip intermediate optional parameters.

Consider extracting the context object to make the skipped parameters explicit:
const gradingContext = { providerResponse };
const { grade, rubric, suggestions } = await grader.getResult(
  prompt,
  outputString,
  test,
  provider,
  renderedValue,
  undefined,
  undefined,
  gradingContext,
);

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 137231a and de74c22.

📒 Files selected for processing (10)

site/docs/usage/command-line.md (1 hunks)
src/assertions/redteam.ts (1 hunks)
src/envars.ts (1 hunks)
src/redteam/plugins/base.ts (4 hunks)
src/redteam/providers/crescendo/index.ts (1 hunks)
src/redteam/providers/custom/index.ts (1 hunks)
src/redteam/providers/goat.ts (1 hunks)
src/redteam/providers/hydra/index.ts (1 hunks)
src/redteam/providers/iterative.ts (1 hunks)
src/redteam/providers/iterativeMeta.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use TypeScript with strict type checking

Files:

src/envars.ts
src/redteam/providers/custom/index.ts
src/redteam/providers/crescendo/index.ts
src/redteam/providers/goat.ts
src/assertions/redteam.ts
src/redteam/plugins/base.ts
src/redteam/providers/hydra/index.ts
src/redteam/providers/iterative.ts
src/redteam/providers/iterativeMeta.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{ts,tsx,js,jsx}: Follow consistent import order - Biome will handle import sorting
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs
Use the logger methods (debug, info, warn, error) with an optional second parameter for context objects that will be automatically sanitized
Use sanitizeObject from ./util/sanitizer for manual sanitization of data before using it in non-logging contexts

Files:

src/envars.ts
src/redteam/providers/custom/index.ts
src/redteam/providers/crescendo/index.ts
src/redteam/providers/goat.ts
src/assertions/redteam.ts
src/redteam/plugins/base.ts
src/redteam/providers/hydra/index.ts
src/redteam/providers/iterative.ts
src/redteam/providers/iterativeMeta.ts

site/docs/**/*.md

📄 CodeRabbit inference engine (site/docs/AGENTS.md)

site/docs/**/*.md: Do not modify existing documentation headings as they are often externally linked
Use 'eval' instead of 'evaluation' in all documentation and code references
Use 'Promptfoo' (capitalized) at the start of sentences and headings, and 'promptfoo' (lowercase) in code, commands, and package names
Every documentation page must include front matter with title (under 60 characters), description (150-160 characters), and sidebar_position fields
Only add titles to complete, runnable code blocks; do not add titles to code fragments
Use comment directives (highlight-next-line, highlight-start/highlight-end) to highlight important lines in code blocks
Never remove existing highlight directives when editing code blocks
Use admonition blocks (:::note, :::warning, :::danger) for important information and always include empty lines around content inside admonitions
Write documentation with clear, concise language using active voice, spell out acronyms on first use, and write for an international audience avoiding idioms

Files:

site/docs/usage/command-line.md

src/redteam/**/*.ts

📄 CodeRabbit inference engine (src/redteam/AGENTS.md)

src/redteam/**/*.ts: Always sanitize when logging red team test content; the second parameter to logger functions is auto-sanitized for harmful/sensitive content
Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance

Files:

src/redteam/providers/custom/index.ts
src/redteam/providers/crescendo/index.ts
src/redteam/providers/goat.ts
src/redteam/plugins/base.ts
src/redteam/providers/hydra/index.ts
src/redteam/providers/iterative.ts
src/redteam/providers/iterativeMeta.ts

src/redteam/plugins/*.ts

📄 CodeRabbit inference engine (src/redteam/AGENTS.md)

src/redteam/plugins/*.ts: Implement RedteamPluginObject interface when adding new plugins to the red team testing framework
Generate targeted test cases for specific vulnerabilities in red team plugins
Include assertions defining failure conditions in red team plugin test cases
Reference src/redteam/plugins/pii.ts as the pattern for implementing new red team plugins

Files:

src/redteam/plugins/base.ts

🧠 Learnings (19)

📓 Common learnings

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-30T04:53:04.572Z
Learning: Pull request titles must follow Conventional Commits format with types (feat/fix/chore/refactor/docs/test/ci/revert/perf) and scopes, with `redteam` being mandatory for all redteam-related changes

📚 Learning: 2025-11-29T00:26:38.602Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:38.602Z
Learning: Applies to test/**/*.test.ts : Use Jest for core promptfoo functionality testing (not Vitest), configured via jest.config.ts, importing from jest/globals

Applied to files:

src/envars.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new red team plugins in the `test/redteam/` directory

Applied to files:

src/envars.ts
src/assertions/redteam.ts
src/redteam/plugins/base.ts

📚 Learning: 2025-07-18T17:25:38.444Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : For trivial test cases in configuration, make them quirky and fun to increase engagement

Applied to files:

src/envars.ts
site/docs/usage/command-line.md

📚 Learning: 2025-07-18T17:25:38.445Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.445Z
Learning: Applies to examples/*/README.md : For each environment variable, explain its purpose, how to obtain it, and any default values or constraints in the README

Applied to files:

site/docs/usage/command-line.md

📚 Learning: 2025-11-29T00:25:12.625Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:12.625Z
Learning: Applies to examples/*/README.md : Document required environment variables in README with format: list variable name and description

Applied to files:

site/docs/usage/command-line.md

📚 Learning: 2025-11-29T00:25:12.625Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/AGENTS.md:0-0
Timestamp: 2025-11-29T00:25:12.625Z
Learning: Applies to examples/**/promptfooconfig.yaml : In promptfooconfig.yaml, maintain strict field order: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests

Applied to files:

site/docs/usage/command-line.md

📚 Learning: 2025-07-18T17:25:38.444Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/examples.mdc:0-0
Timestamp: 2025-07-18T17:25:38.444Z
Learning: Applies to examples/*/promptfooconfig*.yaml : Follow the specific field order in all configuration files: description, env (optional), prompts, providers, defaultTest (optional), scenarios (optional), tests

Applied to files:

site/docs/usage/command-line.md

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/graders.ts : Evaluate attack success using grader logic in `src/redteam/graders.ts`

Applied to files:

src/redteam/providers/custom/index.ts
src/redteam/providers/crescendo/index.ts
src/redteam/providers/goat.ts
src/assertions/redteam.ts
src/redteam/plugins/base.ts
src/redteam/providers/hydra/index.ts
src/redteam/providers/iterative.ts
src/redteam/providers/iterativeMeta.ts

📚 Learning: 2025-11-29T00:24:17.012Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-29T00:24:17.012Z
Learning: Applies to src/redteam/**/*agent*.{ts,tsx,js,jsx} : Maintain clear agent interface definitions and usage patterns

Applied to files:

src/assertions/redteam.ts
src/redteam/providers/iterativeMeta.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Include assertions defining failure conditions in red team plugin test cases

Applied to files:

src/assertions/redteam.ts
src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/**/*.ts : Assign risk severity levels to red team test results: critical for PII leaks and SQL injection, high for jailbreaks/prompt injection/harmful content, medium for bias/hallucination, low for overreliance

Applied to files:

src/assertions/redteam.ts
src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Implement `RedteamPluginObject` interface when adding new plugins to the red team testing framework

Applied to files:

src/assertions/redteam.ts
src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Generate targeted test cases for specific vulnerabilities in red team plugins

Applied to files:

src/assertions/redteam.ts
src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:06.422Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/**/*.ts : Provider implementations must transform promptfoo prompts to provider-specific API format and return normalized `ProviderResponse` for evaluation

Applied to files:

src/assertions/redteam.ts

📚 Learning: 2025-11-29T00:26:16.682Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:16.682Z
Learning: Applies to src/redteam/plugins/*.ts : Reference `src/redteam/plugins/pii.ts` as the pattern for implementing new red team plugins

Applied to files:

src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:38.602Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:38.602Z
Learning: Applies to test/**/test/providers/*.test.ts : Provider test files must cover: success cases (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Applied to files:

src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:06.422Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/test/providers/**/*.{ts,tsx,js} : Every provider implementation must have comprehensive tests in `test/providers/` directory including success cases, error cases, rate limits, timeouts, and invalid config scenarios

Applied to files:

src/redteam/plugins/base.ts

📚 Learning: 2025-11-29T00:26:06.422Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/providers/AGENTS.md:0-0
Timestamp: 2025-11-29T00:26:06.422Z
Learning: Applies to src/providers/test/providers/**/*.{ts,tsx,js} : Provider tests must mock API responses and not call real APIs

Applied to files:

src/redteam/plugins/base.ts

🧬 Code graph analysis (1)

src/redteam/plugins/base.ts (2)

src/envars.ts (1)

getEnvBool (424-433)

src/redteam/util.ts (2)

isEmptyResponse (175-183)

isBasicRefusal (185-191)

🪛 LanguageTool

site/docs/usage/command-line.md

[style] ~728-~728: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... | Short-circuits red team grading to only test guardrails. Skips LLM-based gradin...

(ADVERB_REPETITION_PREMIUM)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Redteam (Staging API)
GitHub Check: security-scan
GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: webui tests
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Redteam (Production API)
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Share Test
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Test on Node 22.x and ubuntu-latest
GitHub Check: Build Docs
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Build on Node 22.x
GitHub Check: Build on Node 20.x
GitHub Check: Build on Node 24.x
GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (11)

src/redteam/providers/custom/index.ts (1)

490-502: LGTM!

The grader invocation correctly passes the provider response context for guardrails-only mode checking. The pattern is consistent with other providers in this PR.

src/redteam/providers/iterativeMeta.ts (1)

290-299: LGTM!

The provider response is correctly propagated to the grader for guardrails evaluation.

src/redteam/providers/hydra/index.ts (1)

443-452: LGTM!

The provider response context is correctly passed to the grader, enabling guardrails-only evaluation mode.

src/redteam/providers/crescendo/index.ts (1)

463-480: LGTM!

The grader invocation correctly combines provider response with optional tracing context. The conditional inclusion of tracing data based on tracingOptions.includeInGrading is well-structured.

src/redteam/providers/goat.ts (1)

417-434: LGTM!

The grader call correctly passes provider response and conditionally includes tracing data for grading. The pattern matches the Crescendo provider implementation.

src/redteam/providers/iterative.ts (1)

374-391: LGTM!

The provider response and optional tracing data are correctly passed to the grader, following the established pattern.

src/redteam/plugins/base.ts (4)

3-3: LGTM!

The import of getEnvBool is correctly added to support the new guardrails-only mode.

14-14: LGTM!

Exporting ProviderResponse type makes it available for the extended RedteamGradingContext interface.

355-360: LGTM!

The addition of providerResponse to RedteamGradingContext enables guardrails-only mode to access the full provider response for blocking detection.

428-469: Well-implemented guardrails-only mode.

The implementation correctly:

Uses environment flag to enable the mode

Safely extracts guardrails signals with fallback defaults

Combines multiple blocking indicators (guardrails.flagged, isRefusal, text-based refusal/empty)

Builds detailed reasons listing all triggers

Returns appropriate pass/fail based on whether the request was blocked

The logic properly handles undefined values and avoids duplicate trigger reporting (line 448).

src/envars.ts (1)

42-48: Feature flag is properly implemented and well-documented.

The environment variable declaration and implementation in src/redteam/plugins/base.ts correctly match the documented behavior. When enabled, the flag skips LLM-based grading and only checks if the response was blocked through guardrails, provider refusal, or text-based signals. The grading result properly maps pass: true when blocked (guardrails worked) and pass: false when not blocked (guardrails didn't trigger). Logging appropriately uses logger.debug with context.

feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY

de74c22

typpo requested a review from a team as a code owner December 1, 2025 16:10

typpo requested a review from MrFlounder December 1, 2025 16:10

promptfoo-scanner bot reviewed Dec 1, 2025

View reviewed changes

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

wip

b5229b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426

feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426

Uh oh!

typpo commented Dec 1, 2025

Uh oh!

promptfoo-scanner bot left a comment

Uh oh!

coderabbitai bot commented Dec 1, 2025

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426

Are you sure you want to change the base?

feat: PROMPTFOO_REDTEAM_GUARDRAILS_ONLY #6426

Uh oh!

Conversation

typpo commented Dec 1, 2025

Uh oh!

promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 1, 2025

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants