FEAT Content harm scenario #1174

hannahwestra25 · 2025-11-06T17:47:19Z

Description

Add content harm scenario which provides a general set of attacks for each harm category. The idea is to have a quick scenario to run a comprehensive set of harms before drilling down into more specific harms. The scenario uses the PromptSending (baseline), RolePlay, ManyShot and RedTeam attacks to provide this summary using a set of objectives (user defined or provided in the datasets/seed_prompts/harms folder) to achieve this.

Tests and Documentation

Added content harm notebook plus instructions for dataset naming.
Added unit tests

…xample_scenario

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

rlundeen2 · 2025-11-08T03:48:34Z

doc/code/scenarios/rapid_response_harm_scenario.py

+
+# Hate speech datasets
+
+hate_stories = await create_seed_dataset(


I think we should manage a few of these, even if the list is incomplete. So instead of having strings in the notebooks, I'd put these in datasets/seed_prompts/ai_rt and maybe one file per category.

Eventually it might be nice to have a single function call that can load all our yaml seedprompts into the database and folks can use those as examples.

I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

I agree, but I think it's safe to require an upload, which could even be done as part of initialization. I think the dataset question can be tackled independently of this PR. Although for this one it'd be nice to include some sample datasets that we can later add to the db easily

E.g. workflow for external user

memory.add_dataset(redteam) # not part of this PR

Run the scenario

I added a .prompt file for each corresponding strategy so users will be able to run the default scenario and have an example in the content_harm_scenario.ipynb/.py! lmk if that doesn't address your concerns!

doc/code/scenarios/rapid_response_harm_scenario.py

pyrit/scenarios/__init__.py

rlundeen2 · 2025-11-08T03:58:22Z

pyrit/scenarios/scenarios/e2e/content_harm_scenario.py

+    Each harm categories has a few different strategies to test different aspects of the harm type.
+    """
+
+    ALL = ("all", {"all"})


One idea is to only have the meta-categories. I think this may make the most sense just to have hate, fairness, violence.... leakage vs each individual scenario_strategy

I think the composition makes the code quite a bit more complicated, and I would guess most users will either just want to use "all" or a subset of the categories

In other words, I think it should look like the following (and that's it)

class RapidResponseHarmStrategy(ScenarioStrategy): """ RapidResponseHarmStrategy defines a set of strategies for testing model behavior in several different harm categories. Each harm categories has a few different strategies to test different aspects of the harm type. """ ALL = ("all", {"all"}) HATE = ("hate", set[str]()) FAIRNESS = ("fairness", set[str]()) VIOLENCE = ("violence", {set[str]()) SEXUAL = ("sexual", set[str]()) HARASSMENT = ("harassment", set[str]()) MISINFORMATION = ("misinformation", set[str]()) LEAKAGE = ("leakage", set[str]())

Alternatively, if you do want a long and short running version (which I also think is legit!) I might split it up like this, where the complex attacks contain long running methods. But my gut is that, it might just be simpler to have a completely separate scenario class for those

ALL = ("all", {"all"}) HATE_QUICK = ("hate_quick", {"quick", "hate"}) HATE_EXTENDED = ("hate_extended", {"complex", "hate"}) FAIRNESS_QUICK = ("fairness_quick", {"quick", "fairness"}) ...

Either way, I'd keep specific techniques out, and specific tests/datasets

updated to only have the meta categories! I was considering the basic / extended options which would selectively run fewer attacks (maybe baseline and the single turn attacks ?) but was thinking that users would want more information for a given harm to get more of an idea of where to explore next. Frederic has pointed out that the PromptSendingAttack is likely to be unsuccessful so we want at least one other attack to give some avenue to explore.

I like only varying one dimension on the ScenarioStrategies if we can. I think it makes it less confusing.

How I might tackle this is have a base class

ContentHarmScenarioBase

And then subclass

ContenHarmScenarioFast
ContentHarmScenarioMed
ContentHarmScenarioComplex

And each of the subclasses then set the strategies to use (and can include others. e.g. Med can include Fast).

I'm not set on this approach. We could potentially combine both "harm" and "strategy" into the class. But imo it's easier to use if we can split like this if we can. Although one weirdness is that sometimes we'll split by harm, others by strategy. I think that's probably less confusing though for users?

Note, I might not split them in this PR but in a followup

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

pyrit/scenarios/scenarios/e2e/content_harm_scenario.py

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

rlundeen2 · 2025-11-08T04:33:27Z

Overall this is good! It'll be really nice to have solid examples here :)

My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".

And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:

Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

doc/code/scenarios/rapid_response_harm_scenario.py

fdubut

Adding a few comments, mainly on structure and naming. I'll try to run my notebook shortly "as a scenario" to get a better sense of how this all works, and will share more feedback then.

doc/code/scenarios/rapid_response_harm_dataset_loading.md

fdubut · 2025-11-10T21:45:22Z

doc/code/scenarios/rapid_response_harm_scenario.py

+
+# Hate speech datasets
+
+hate_stories = await create_seed_dataset(


I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

fdubut · 2025-11-10T21:47:26Z

doc/code/scenarios/rapid_response_harm_scenario.py

+    model_name=""
+)
+
+# Define the helper adversarial target


Given the nature of the scenario, returning aggregate results on a variety of test cases, I'm wondering if we should give the option to customers to skip all test cases that require an adversarial target if they don't have one available. I think a lot of attacks that would succeed with a true adversarial target will fail with a regular model, skewing the final results.

will follow up with another PR with a fix for the multi turn attack that will allow us to use multi prompt instead of red teaming eliminating the need for the adversarial target!

fdubut · 2025-11-10T23:03:44Z

doc/code/scenarios/rapid_response_harm_scenario.py

+# %%
+# Load the datasets into memory
+
+violence_civic_data = await create_seed_dataset(


In the original notebook, the prompts are sequential (passed using multi-prompt attack). I haven't looked yet at the actual scenario definition but wanted to point that out.

mentioned offline but there's an issue with Multi Prompt which basically makes it error out so for now am using red teaming attack. For this PR, I'm going to keep as RedTeaming and then when we work through that issue, I can update this scenario (I like the idea of keeping this simple and the multiprompt is a simpler multi turn attack so am preferential to using it).

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

pyrit/scenarios/scenarios/e2e/content_harm_scenario.py

fdubut · 2025-11-10T23:11:07Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+        *,
+        objective_target: PromptTarget,
+        scenario_strategies: Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None = None,
+        adversarial_chat: Optional[PromptChatTarget] = None,


Similar to what I mentioned in my comment on the notebook, I'm wondering if we should exclude from the scenario the attacks that require an adversarial chat when none is passed.

I think we can set a default; this is what the foundry scenario does

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

jsong468 · 2025-11-11T00:35:45Z

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py

+            scenario_strategies (Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None):
+                The harm strategies or composite strategies to include in this scenario. If None,
+                defaults to RapidResponseHarmStrategy.ALL.


will a user be able to compose a multi-turn scenario strategy like Crescendo? or just sticking with single turn/multiprompt sending attacks?

Currently, the default behavior is to run PromptSending (the baseline), RolePlaying, RedTeaming, and ManyShot by default; I'm on the fence about having a basic & extended version (basic would maybe just run promptsending and red teaming vs extended which would run them all; my reservation is that idk how much value the scenario has when the basic version would be run because as the name suggests, it's pretty basic) wdyt ?

…xample_scenario

rlundeen2 · 2025-11-14T16:42:50Z

doc/code/scenarios/0_scenarios.md

+### Example Structure
+
+```python
+class MyStrategy(ScenarioStrategy):


I recommend making this an ipynb. Even if it doesn't do anything, we can run this as a cell and make sure the syntax is correct - eg if we rename something it will not run.

rlundeen2 · 2025-11-14T16:45:27Z

doc/code/scenarios/0_scenarios.md

+
+### Existing Scenarios
+
+- **EncodingScenario**: Tests encoding attacks (Base64, ROT13, etc.) with seed prompts and decoding templates


We can actually print this using frontend.scenario_registry.list_scenarios. I have some formatting also but that's still in another branch and this might be good enough for this purpose.

I think worth doing scenario listing programatically so we don't have to keep the list up to date

doc/code/scenarios/0_scenarios.md

rlundeen2 · 2025-11-14T16:49:10Z

doc/code/scenarios/1_foundry_scenario.ipynb

    "\n",
-    "## What is a Scenario?\n",
+    "The `FoundryScenario` provides a comprehensive testing approach that includes:\n",
+    "- **Converter-based attacks**: Apply various encoding/obfuscation techniques (Base64, Caesar cipher, etc.)\n",


scenarios can take a long time to run in our integration tests. We definitely don't want a notebook for every one. But I could see this and content_harms being good examples.

I might organize it like this

0_scenarios.ipynb (updated version of 0_scenarios.md)

1_end_to_end_scenario.ipnb (combined content harm scenario and end_to_end scenario)

2_composite_sceanrios.ipynb (foundry scenario, with a bit extra about composite strategies)

rlundeen2 · 2025-11-14T16:51:57Z

doc/code/scenarios/2_end_to_end_scenario_datasets.md

+
+## Sample Datasets
+
+PyRIT provides sample datasets for each harm category in the `ContentHarmStrategy` enum, located in the `pyrit/datasets/seed_prompts/harms/` folder. Each harm category has a corresponding YAML file (e.g., `hate.prompt`, `violence.prompt`, `sexual.prompt`) containing pre-defined prompts and objectives designed to test model behavior for that specific harm type. These datasets can be loaded directly using `SeedDataset.from_yaml_file()` and serve as a starting point for content harm testing, though you can also create custom datasets to suit your specific testing needs.


I'd sync with Victor on this workflow. I'm on the fence. His actually only used the sample datasets - which we definitely want the database option as the "default" but I did like how his worked out of the box. Is there a middle ground that is best? E.g. try to get from memory, but if that doesn't exist, load the sample datasets? I'm not sure. But either way I think if the two of you brainstorm and land on one way to do it.

rlundeen2 · 2025-11-14T16:57:12Z

doc/code/scenarios/2_end_to_end_scenario_datasets.md

+The naming schema is **critical** for these scenarios to automatically retrieve the correct datasets. The schema follows this pattern:
+
+```
+<seed_dataset_prefix>_<strategy_name>


I wonder if we could add a method to the ScenarioStrategy that is something like

load_sample_seed_prompts(seed_prompt)

That links the class to the sample datasets. Not something for this PR, but it would be nice for a followup. Then the folks don't have to worry about getting the name right, and the errors you've documented can be self explanatory as part of the class.

Would you mind following up though if not in this PR? E.g. creating a followup item if you think it's a good idea?

doc/code/scenarios/3_content_harm_scenario.ipynb

doc/code/scenarios/3_content_harm_scenario.py

rlundeen2 · 2025-11-14T17:01:33Z

doc/code/scenarios/3_content_harm_scenario.py

+
+# %% [markdown]
+#
+# We can also selectively choose which strategies to run. In this next example, we'll only run the Hate, Violence, and Harassment strategies.


I might only include one code example, maybe a comment with other strategies you can use. My only concern is I don't want integration tests to take too long if we can avoid it without losing clarity

rlundeen2 · 2025-11-14T17:25:38Z

pyrit/scenarios/__init__.py

    "ScenarioIdentifier",
    "ScenarioResult",
+    "ContentHarmScenario",
+    "ContentHarmStrategy",


I know I know naming is impossible. But a couple things

I suspect we are going to have a ton of scenarios. I like having a prefix to import these. So people would do

from scenarios.e2e import ContentHarmScenario

instead of

from scenarios import ContentHarmScenario

I want to follow this elsewhere also, e.g. from scenarios.garak import EncodingScenario

I don't like e2e as a name. All scenarios in theory are e2e. I did like airt, but I understand Frederic's point. Some suggestions I also like

redteam_ops
redteam

hannahwestra25 added 4 commits November 6, 2025 09:59

first draft of harm scenario

6a3da9e

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

04d5742

…xample_scenario

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

2c03890

…xample_scenario

add tests and documentation

ea488f5

hannahwestra25 commented Nov 7, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

hannahwestra25 changed the title ~~[DRAFT] rapid response harm scenario~~ Rapid response harm scenario Nov 7, 2025

jsong468 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

doc/code/scenarios/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

doc/code/scenarios/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

doc/code/scenarios/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/__init__.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/e2e/content_harm_scenario.py Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

rlundeen2 reviewed Nov 8, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jbolor21 reviewed Nov 8, 2025

View reviewed changes

doc/code/scenarios/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

romanlutz changed the title ~~Rapid response harm scenario~~ FEAT Rapid response harm scenario Nov 8, 2025

jsong468 reviewed Nov 10, 2025

View reviewed changes

doc/code/scenarios/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

PR comments: simplify strategies and default to several attack types

ddc6fe0

fdubut reviewed Nov 10, 2025

View reviewed changes

jsong468 reviewed Nov 11, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 11, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 11, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 11, 2025

View reviewed changes

pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py Outdated Show resolved Hide resolved

jsong468 reviewed Nov 11, 2025

View reviewed changes

hannahwestra25 added 6 commits November 11, 2025 17:10

add prompts and tests

8069f17

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

3dd0261

…xample_scenario

add scenario instructions

19dbc86

rename and add attacks

f342e50

rename folder, update notebook and md file

6366e7e

Merge branch 'main' of https://github.com/Azure/PyRIT into hawestra/e…

0bff470

…xample_scenario

hannahwestra25 changed the title ~~FEAT Rapid response harm scenario~~ FEAT Content harm scenario Nov 13, 2025

hannahwestra25 added 6 commits November 13, 2025 11:49

merge

e290f73

rename file

5ee2103

update docs

1fe542b

fix path

4c4cdca

fix naming

17c5435

pre commit

0e860e8

rlundeen2 reviewed Nov 14, 2025

View reviewed changes

doc/code/scenarios/0_scenarios.md Show resolved Hide resolved

rlundeen2 reviewed Nov 14, 2025

View reviewed changes

doc/code/scenarios/3_content_harm_scenario.ipynb Show resolved Hide resolved

rlundeen2 reviewed Nov 14, 2025

View reviewed changes

doc/code/scenarios/3_content_harm_scenario.py Show resolved Hide resolved

rlundeen2 reviewed Nov 14, 2025

View reviewed changes


		# Hate speech datasets

		hate_stories = await create_seed_dataset(


		### Existing Scenarios

		- EncodingScenario: Tests encoding attacks (Base64, ROT13, etc.) with seed prompts and decoding templates


		## Sample Datasets

		PyRIT provides sample datasets for each harm category in the `ContentHarmStrategy` enum, located in the `pyrit/datasets/seed_prompts/harms/` folder. Each harm category has a corresponding YAML file (e.g., `hate.prompt`, `violence.prompt`, `sexual.prompt`) containing pre-defined prompts and objectives designed to test model behavior for that specific harm type. These datasets can be loaded directly using `SeedDataset.from_yaml_file()` and serve as a starting point for content harm testing, though you can also create custom datasets to suit your specific testing needs.

FEAT Content harm scenario #1174

Are you sure you want to change the base?

FEAT Content harm scenario #1174

Uh oh!

Conversation

hannahwestra25 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rlundeen2 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fdubut left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

hannahwestra25 commented Nov 6, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 14, 2025 •

edited

Loading

rlundeen2 commented Nov 8, 2025 •

edited

Loading

rlundeen2 Nov 14, 2025 •

edited

Loading

rlundeen2 Nov 14, 2025 •

edited

Loading

rlundeen2 Nov 14, 2025 •

edited

Loading

rlundeen2 Nov 14, 2025 •

edited

Loading