-
Notifications
You must be signed in to change notification settings - Fork 602
FEAT Content harm scenario #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEAT Content harm scenario #1174
Conversation
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
|
|
||
| # Hate speech datasets | ||
|
|
||
| hate_stories = await create_seed_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should manage a few of these, even if the list is incomplete. So instead of having strings in the notebooks, I'd put these in datasets/seed_prompts/ai_rt and maybe one file per category.
Eventually it might be nice to have a single function call that can load all our yaml seedprompts into the database and folks can use those as examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.
Btw I'll keep fighting against the ai_rt naming for external assets 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think it's safe to require an upload, which could even be done as part of initialization. I think the dataset question can be tackled independently of this PR. Although for this one it'd be nice to include some sample datasets that we can later add to the db easily
E.g. workflow for external user
- memory.add_dataset(redteam) # not part of this PR
- Run the scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a .prompt file for each corresponding strategy so users will be able to run the default scenario and have an example in the content_harm_scenario.ipynb/.py! lmk if that doesn't address your concerns!
| Each harm categories has a few different strategies to test different aspects of the harm type. | ||
| """ | ||
|
|
||
| ALL = ("all", {"all"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea is to only have the meta-categories. I think this may make the most sense just to have hate, fairness, violence.... leakage vs each individual scenario_strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the composition makes the code quite a bit more complicated, and I would guess most users will either just want to use "all" or a subset of the categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I think it should look like the following (and that's it)
class RapidResponseHarmStrategy(ScenarioStrategy):
"""
RapidResponseHarmStrategy defines a set of strategies for testing model behavior
in several different harm categories.
Each harm categories has a few different strategies to test different aspects of the harm type.
"""
ALL = ("all", {"all"})
HATE = ("hate", set[str]())
FAIRNESS = ("fairness", set[str]())
VIOLENCE = ("violence", {set[str]())
SEXUAL = ("sexual", set[str]())
HARASSMENT = ("harassment", set[str]())
MISINFORMATION = ("misinformation", set[str]())
LEAKAGE = ("leakage", set[str]())
Alternatively, if you do want a long and short running version (which I also think is legit!) I might split it up like this, where the complex attacks contain long running methods. But my gut is that, it might just be simpler to have a completely separate scenario class for those
ALL = ("all", {"all"})
HATE_QUICK = ("hate_quick", {"quick", "hate"})
HATE_EXTENDED = ("hate_extended", {"complex", "hate"})
FAIRNESS_QUICK = ("fairness_quick", {"quick", "fairness"})
...
Either way, I'd keep specific techniques out, and specific tests/datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to only have the meta categories! I was considering the basic / extended options which would selectively run fewer attacks (maybe baseline and the single turn attacks ?) but was thinking that users would want more information for a given harm to get more of an idea of where to explore next. Frederic has pointed out that the PromptSendingAttack is likely to be unsuccessful so we want at least one other attack to give some avenue to explore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like only varying one dimension on the ScenarioStrategies if we can. I think it makes it less confusing.
How I might tackle this is have a base class
ContentHarmScenarioBase
And then subclass
ContenHarmScenarioFast
ContentHarmScenarioMed
ContentHarmScenarioComplex
And each of the subclasses then set the strategies to use (and can include others. e.g. Med can include Fast).
I'm not set on this approach. We could potentially combine both "harm" and "strategy" into the class. But imo it's easier to use if we can split like this if we can. Although one weirdness is that sometimes we'll split by harm, others by strategy. I think that's probably less confusing though for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, I might not split them in this PR but in a followup
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
|
Overall this is good! It'll be really nice to have solid examples here :) My biggest feedback is that I think we should define exactly what we want out of this scenario. Here is what I think it is. "Can I get a vibe of this objective_target in a couple hours based on how it does on these harm categories". And if we keep that strategy, we want to do the best we can to answer that question, and the strategies themselves should be baked in as much as possible. Along these lines, I'd recommend:
|
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
fdubut
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few comments, mainly on structure and naming. I'll try to run my notebook shortly "as a scenario" to get a better sense of how this all works, and will share more feedback then.
|
|
||
| # Hate speech datasets | ||
|
|
||
| hate_stories = await create_seed_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go even further and say we should provide a truly end-to-end solution here that gives some results even if the customer doesn't bring their own datasets. Of course, the conundrum is that we may not be able to share the exact datasets we're using, but maybe it's something we should actually strive for.
Btw I'll keep fighting against the ai_rt naming for external assets 😆
| model_name="" | ||
| ) | ||
|
|
||
| # Define the helper adversarial target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the nature of the scenario, returning aggregate results on a variety of test cases, I'm wondering if we should give the option to customers to skip all test cases that require an adversarial target if they don't have one available. I think a lot of attacks that would succeed with a true adversarial target will fail with a regular model, skewing the final results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will follow up with another PR with a fix for the multi turn attack that will allow us to use multi prompt instead of red teaming eliminating the need for the adversarial target!
| # %% | ||
| # Load the datasets into memory | ||
|
|
||
| violence_civic_data = await create_seed_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original notebook, the prompts are sequential (passed using multi-prompt attack). I haven't looked yet at the actual scenario definition but wanted to point that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioned offline but there's an issue with Multi Prompt which basically makes it error out so for now am using red teaming attack. For this PR, I'm going to keep as RedTeaming and then when we work through that issue, I can update this scenario (I like the idea of keeping this simple and the multiprompt is a simpler multi turn attack so am preferential to using it).
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
| *, | ||
| objective_target: PromptTarget, | ||
| scenario_strategies: Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None = None, | ||
| adversarial_chat: Optional[PromptChatTarget] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to what I mentioned in my comment on the notebook, I'm wondering if we should exclude from the scenario the attacks that require an adversarial chat when none is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set a default; this is what the foundry scenario does
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
pyrit/scenarios/scenarios/ai_rt/rapid_response_harm_scenario.py
Outdated
Show resolved
Hide resolved
| scenario_strategies (Sequence[RapidResponseHarmStrategy | ScenarioCompositeStrategy] | None): | ||
| The harm strategies or composite strategies to include in this scenario. If None, | ||
| defaults to RapidResponseHarmStrategy.ALL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will a user be able to compose a multi-turn scenario strategy like Crescendo? or just sticking with single turn/multiprompt sending attacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the default behavior is to run PromptSending (the baseline), RolePlaying, RedTeaming, and ManyShot by default; I'm on the fence about having a basic & extended version (basic would maybe just run promptsending and red teaming vs extended which would run them all; my reservation is that idk how much value the scenario has when the basic version would be run because as the name suggests, it's pretty basic) wdyt ?
…xample_scenario
…xample_scenario
| ### Example Structure | ||
|
|
||
| ```python | ||
| class MyStrategy(ScenarioStrategy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend making this an ipynb. Even if it doesn't do anything, we can run this as a cell and make sure the syntax is correct - eg if we rename something it will not run.
|
|
||
| ### Existing Scenarios | ||
|
|
||
| - **EncodingScenario**: Tests encoding attacks (Base64, ROT13, etc.) with seed prompts and decoding templates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can actually print this using frontend.scenario_registry.list_scenarios. I have some formatting also but that's still in another branch and this might be good enough for this purpose.
I think worth doing scenario listing programatically so we don't have to keep the list up to date
| "\n", | ||
| "## What is a Scenario?\n", | ||
| "The `FoundryScenario` provides a comprehensive testing approach that includes:\n", | ||
| "- **Converter-based attacks**: Apply various encoding/obfuscation techniques (Base64, Caesar cipher, etc.)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scenarios can take a long time to run in our integration tests. We definitely don't want a notebook for every one. But I could see this and content_harms being good examples.
I might organize it like this
- 0_scenarios.ipynb (updated version of 0_scenarios.md)
- 1_end_to_end_scenario.ipnb (combined content harm scenario and end_to_end scenario)
- 2_composite_sceanrios.ipynb (foundry scenario, with a bit extra about composite strategies)
|
|
||
| ## Sample Datasets | ||
|
|
||
| PyRIT provides sample datasets for each harm category in the `ContentHarmStrategy` enum, located in the `pyrit/datasets/seed_prompts/harms/` folder. Each harm category has a corresponding YAML file (e.g., `hate.prompt`, `violence.prompt`, `sexual.prompt`) containing pre-defined prompts and objectives designed to test model behavior for that specific harm type. These datasets can be loaded directly using `SeedDataset.from_yaml_file()` and serve as a starting point for content harm testing, though you can also create custom datasets to suit your specific testing needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd sync with Victor on this workflow. I'm on the fence. His actually only used the sample datasets - which we definitely want the database option as the "default" but I did like how his worked out of the box. Is there a middle ground that is best? E.g. try to get from memory, but if that doesn't exist, load the sample datasets? I'm not sure. But either way I think if the two of you brainstorm and land on one way to do it.
| The naming schema is **critical** for these scenarios to automatically retrieve the correct datasets. The schema follows this pattern: | ||
|
|
||
| ``` | ||
| <seed_dataset_prefix>_<strategy_name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could add a method to the ScenarioStrategy that is something like
load_sample_seed_prompts(seed_prompt)
That links the class to the sample datasets. Not something for this PR, but it would be nice for a followup. Then the folks don't have to worry about getting the name right, and the errors you've documented can be self explanatory as part of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind following up though if not in this PR? E.g. creating a followup item if you think it's a good idea?
|
|
||
| # %% [markdown] | ||
| # | ||
| # We can also selectively choose which strategies to run. In this next example, we'll only run the Hate, Violence, and Harassment strategies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might only include one code example, maybe a comment with other strategies you can use. My only concern is I don't want integration tests to take too long if we can avoid it without losing clarity
| "ScenarioIdentifier", | ||
| "ScenarioResult", | ||
| "ContentHarmScenario", | ||
| "ContentHarmStrategy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I know naming is impossible. But a couple things
- I suspect we are going to have a ton of scenarios. I like having a prefix to import these. So people would do
from scenarios.e2e import ContentHarmScenario
instead of
from scenarios import ContentHarmScenario
I want to follow this elsewhere also, e.g. from scenarios.garak import EncodingScenario
- I don't like e2e as a name. All scenarios in theory are e2e. I did like airt, but I understand Frederic's point. Some suggestions I also like
redteam_ops
redteam
Description
Add content harm scenario which provides a general set of attacks for each harm category. The idea is to have a quick scenario to run a comprehensive set of harms before drilling down into more specific harms. The scenario uses the PromptSending (baseline), RolePlay, ManyShot and RedTeam attacks to provide this summary using a set of objectives (user defined or provided in the datasets/seed_prompts/harms folder) to achieve this.
Tests and Documentation
Added content harm notebook plus instructions for dataset naming.
Added unit tests