Skip to content

Conversation

@hannahwestra25
Copy link
Contributor

@hannahwestra25 hannahwestra25 commented Nov 6, 2025

Description

Add content harm scenario which provides a general set of attacks for each harm category. The idea is to have a quick scenario to run a comprehensive set of harms before drilling down into more specific harms. The scenario uses the PromptSending (baseline), RolePlay, ManyShot and RedTeam attacks to provide this summary using a set of objectives (user defined or provided in the datasets/seed_prompts/harms folder) to achieve this.

Tests and Documentation

Added content harm notebook plus instructions for dataset naming.
Added unit tests

@hannahwestra25 hannahwestra25 changed the title [DRAFT] rapid response harm scenario Rapid response harm scenario Nov 7, 2025

# Hate speech datasets

hate_stories = await create_seed_dataset(
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should manage a few of these, even if the list is incomplete. So instead of having strings in the notebooks, I'd put these in datasets/seed_prompts/ai_rt and maybe one file per category.

Eventually it might be nice to have a single function call that can load all our yaml seedprompts into the database and folks can use those as examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think it's safe to require an upload, which could even be done as part of initialization. I think the dataset question can be tackled independently of this PR. Although for this one it'd be nice to include some sample datasets that we can later add to the db easily

E.g. workflow for external user

  1. memory.add_dataset(redteam) # not part of this PR
  2. Run the scenario

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a .prompt file for each corresponding strategy so users will be able to run the default scenario and have an example in the content_harm_scenario.ipynb/.py! lmk if that doesn't address your concerns!

Each harm categories has a few different strategies to test different aspects of the harm type.
"""

ALL = ("all", {"all"})
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea is to only have the meta-categories. I think this may make the most sense just to have hate, fairness, violence.... leakage vs each individual scenario_strategy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the composition makes the code quite a bit more complicated, and I would guess most users will either just want to use "all" or a subset of the categories

Copy link
Contributor

@rlundeen2 rlundeen2 Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I think it should look like the following (and that's it)

class RapidResponseHarmStrategy(ScenarioStrategy):
    """
   RapidResponseHarmStrategy defines a set of strategies for testing model behavior
    in several different harm categories.

    Each harm categories has a few different strategies to test different aspects of the harm type.
    """

    ALL = ("all", {"all"})
    HATE = ("hate", set[str]())
    FAIRNESS = ("fairness", set[str]())
    VIOLENCE = ("violence", {set[str]())
    SEXUAL = ("sexual", set[str]())
    HARASSMENT = ("harassment", set[str]())
    MISINFORMATION = ("misinformation", set[str]())
    LEAKAGE = ("leakage", set[str]())

Alternatively, if you do want a long and short running version (which I also think is legit!) I might split it up like this, where the complex attacks contain long running methods. But my gut is that, it might just be simpler to have a completely separate scenario class for those

    ALL = ("all", {"all"})
    HATE_QUICK = ("hate_quick", {"quick", "hate"})
    HATE_EXTENDED = ("hate_extended", {"complex", "hate"})
    FAIRNESS_QUICK = ("fairness_quick",  {"quick", "fairness"})
...

Either way, I'd keep specific techniques out, and specific tests/datasets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to only have the meta categories! I was considering the basic / extended options which would selectively run fewer attacks (maybe baseline and the single turn attacks ?) but was thinking that users would want more information for a given harm to get more of an idea of where to explore next. Frederic has pointed out that the PromptSendingAttack is likely to be unsuccessful so we want at least one other attack to give some avenue to explore.

Copy link
Contributor

@rlundeen2 rlundeen2 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like only varying one dimension on the ScenarioStrategies if we can. I think it makes it less confusing.

How I might tackle this is have a base class

ContentHarmScenarioBase

And then subclass

ContenHarmScenarioFast
ContentHarmScenarioMed
ContentHarmScenarioComplex

And each of the subclasses then set the strategies to use (and can include others. e.g. Med can include Fast).

I'm not set on this approach. We could potentially combine both "harm" and "strategy" into the class. But imo it's easier to use if we can split like this if we can. Although one weirdness is that sometimes we'll split by harm, others by strategy. I think that's probably less confusing though for users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, I might not split them in this PR but in a followup

@rlundeen2
Copy link
Contributor

rlundeen2 commented Nov 8, 2025

Overall this is good! It'll be really nice to have solid examples here :)

My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories".

And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:

  1. Simplify the strategies. I suspect most users just want to run "all" to get a vibe check, or to run specific harm categories. And if there is a strategy they want but it takes a long time (like crescendo) maybe we should split that off into a seperate longer-running scenario class.
  2. Choose the attacks to do with those strategies explicitly (which converters and attacks to use). E.g. we can get the objectives from memory, and then this scenario can decide how we send those. I wouldn't make this configurable, because it adds another dimension to things.

@romanlutz romanlutz changed the title Rapid response harm scenario FEAT Rapid response harm scenario Nov 8, 2025
Copy link
Contributor

@fdubut fdubut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few comments, mainly on structure and naming. I'll try to run my notebook shortly "as a scenario" to get a better sense of how this all works, and will share more feedback then.


# Hate speech datasets

hate_stories = await create_seed_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.

Btw I'll keep fighting against the ai_rt naming for external assets 😆

model_name=""
)

# Define the helper adversarial target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the nature of the scenario, returning aggregate results on a variety of test cases, I'm wondering if we should give the option to customers to skip all test cases that require an adversarial target if they don't have one available. I think a lot of attacks that would succeed with a true adversarial target will fail with a regular model, skewing the final results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will follow up with another PR with a fix for the multi turn attack that will allow us to use multi prompt instead of red teaming eliminating the need for the adversarial target!

# %%
# Load the datasets into memory

violence_civic_data = await create_seed_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original notebook, the prompts are sequential (passed using multi-prompt attack). I haven't looked yet at the actual scenario definition but wanted to point that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned offline but there's an issue with Multi Prompt which basically makes it error out so for now am using red teaming attack. For this PR, I'm going to keep as RedTeaming and then when we work through that issue, I can update this scenario (I like the idea of keeping this simple and the multiprompt is a simpler multi turn attack so am preferential to using it).

*,
objective_target: PromptTarget,
scenario_strategies: Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None = None,
adversarial_chat: Optional[PromptChatTarget] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what I mentioned in my comment on the notebook, I'm wondering if we should exclude from the scenario the attacks that require an adversarial chat when none is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can set a default; this is what the foundry scenario does

Comment on lines 111 to 113
scenario_strategies (Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None):
The harm strategies or composite strategies to include in this scenario. If None,
defaults to RapidResponseHarmStrategy.ALL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will a user be able to compose a multi-turn scenario strategy like Crescendo? or just sticking with single turn/multiprompt sending attacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the default behavior is to run PromptSending (the baseline), RolePlaying, RedTeaming, and ManyShot by default; I'm on the fence about having a basic & extended version (basic would maybe just run promptsending and red teaming vs extended which would run them all; my reservation is that idk how much value the scenario has when the basic version would be run because as the name suggests, it's pretty basic) wdyt ?

@hannahwestra25 hannahwestra25 changed the title FEAT Rapid response harm scenario FEAT Content harm scenario Nov 13, 2025
### Example Structure

```python
class MyStrategy(ScenarioStrategy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend making this an ipynb. Even if it doesn't do anything, we can run this as a cell and make sure the syntax is correct - eg if we rename something it will not run.


### Existing Scenarios

- **EncodingScenario**: Tests encoding attacks (Base64, ROT13, etc.) with seed prompts and decoding templates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually print this using frontend.scenario_registry.list_scenarios. I have some formatting also but that's still in another branch and this might be good enough for this purpose.

I think worth doing scenario listing programatically so we don't have to keep the list up to date

"\n",
"## What is a Scenario?\n",
"The `FoundryScenario` provides a comprehensive testing approach that includes:\n",
"- **Converter-based attacks**: Apply various encoding/obfuscation techniques (Base64, Caesar cipher, etc.)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scenarios can take a long time to run in our integration tests. We definitely don't want a notebook for every one. But I could see this and content_harms being good examples.

I might organize it like this

  • 0_scenarios.ipynb (updated version of 0_scenarios.md)
  • 1_end_to_end_scenario.ipnb (combined content harm scenario and end_to_end scenario)
  • 2_composite_sceanrios.ipynb (foundry scenario, with a bit extra about composite strategies)


## Sample Datasets

PyRIT provides sample datasets for each harm category in the `ContentHarmStrategy` enum, located in the `pyrit/datasets/seed_prompts/harms/` folder. Each harm category has a corresponding YAML file (e.g., `hate.prompt`, `violence.prompt`, `sexual.prompt`) containing pre-defined prompts and objectives designed to test model behavior for that specific harm type. These datasets can be loaded directly using `SeedDataset.from_yaml_file()` and serve as a starting point for content harm testing, though you can also create custom datasets to suit your specific testing needs.
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd sync with Victor on this workflow. I'm on the fence. His actually only used the sample datasets - which we definitely want the database option as the "default" but I did like how his worked out of the box. Is there a middle ground that is best? E.g. try to get from memory, but if that doesn't exist, load the sample datasets? I'm not sure. But either way I think if the two of you brainstorm and land on one way to do it.

The naming schema is **critical** for these scenarios to automatically retrieve the correct datasets. The schema follows this pattern:

```
<seed_dataset_prefix>_<strategy_name>
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could add a method to the ScenarioStrategy that is something like

load_sample_seed_prompts(seed_prompt)

That links the class to the sample datasets. Not something for this PR, but it would be nice for a followup. Then the folks don't have to worry about getting the name right, and the errors you've documented can be self explanatory as part of the class.

Copy link
Contributor

@rlundeen2 rlundeen2 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind following up though if not in this PR? E.g. creating a followup item if you think it's a good idea?


# %% [markdown]
#
# We can also selectively choose which strategies to run. In this next example, we'll only run the Hate, Violence, and Harassment strategies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might only include one code example, maybe a comment with other strategies you can use. My only concern is I don't want integration tests to take too long if we can avoid it without losing clarity

"ScenarioIdentifier",
"ScenarioResult",
"ContentHarmScenario",
"ContentHarmStrategy",
Copy link
Contributor

@rlundeen2 rlundeen2 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I know naming is impossible. But a couple things

  1. I suspect we are going to have a ton of scenarios. I like having a prefix to import these. So people would do
from scenarios.e2e import ContentHarmScenario

instead of

from scenarios import ContentHarmScenario

I want to follow this elsewhere also, e.g. from scenarios.garak import EncodingScenario

  1. I don't like e2e as a name. All scenarios in theory are e2e. I did like airt, but I understand Frederic's point. Some suggestions I also like

redteam_ops
redteam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants