[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438riyosha wants to merge 59 commits intomicrosoft:mainfrom
Conversation
romanlutz
left a comment
There was a problem hiding this comment.
This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...
Other things:
- needs mentioning in api.rst
- needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
- needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...
| Returns: | ||
| Optional[AttackScoringConfig]: The scoring configuration. | ||
| """ | ||
| return AttackScoringConfig( |
There was a problem hiding this comment.
I'm a bit surprised that we're unpacking the attack scoring config in the constructor into these two below, and then reassembling it here. Is that a pattern you've seen in another executor?
There was a problem hiding this comment.
I did see it in some other executor, but it would be better to forgo the unpacking and reassembly so I removed it
| def __init__( | ||
| self, | ||
| *, | ||
| objective_target: PromptTarget = REQUIRED_VALUE, # type: ignore[assignment] |
There was a problem hiding this comment.
Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.
There was a problem hiding this comment.
TargetCapabilities doesn't including reasoning models yet. I commented out a note to add this when it has been included!
| outcome=final_outcome, | ||
| ) | ||
|
|
||
| # AttackLM |
There was a problem hiding this comment.
This was just terminology from the paper for the attacking model. I changed it for clarity
| Returns: | ||
| Optional[Score]: The score for the response. | ||
| """ | ||
| if not self._objective_scorer: |
There was a problem hiding this comment.
Personally, I'm not a huge fan of this. If people want that they can explicitly specify the current fallback scorer as their objective scorer. In fact, we have a fairly good refusal scorer. If you think this one is useful we can add it as a deterministic option into the score module.
There was a problem hiding this comment.
I removed the fallback scorer
| if score_value >= 1.0: | ||
| self._logger.info("Attack successful!") | ||
| context.attack_succeeded = True | ||
| final_outcome = AttackOutcome.SUCCESS | ||
| break | ||
| if score_value >= 0.8: | ||
| self._logger.info("Attack largely successful (score 0.8+)") | ||
| final_outcome = AttackOutcome.SUCCESS | ||
| break |
There was a problem hiding this comment.
This should probably be configured via the scoring configuration. In other attacks, we expect a true/false scorer. So here it could be a threshold based on that takes a float scale scorer and then applies the threshold at 0.8 or 1.0 on top fo that to determine success.
…versation workspaces and media serving (microsoft#1419) Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Victor Valbuena <vvalbuena@microsoft.com> Co-authored-by: Spencer Schoenberg <23708360+spencrr@users.noreply.github.com>
Co-authored-by: Richard Lundeen <rlundeen@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Victor Valbuena <50061128+ValbuenaVC@users.noreply.github.com>
…#1453) Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#1459) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1460) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Waris <warisgill@microsoft.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1458) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#1465) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion (microsoft#1472) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…c auth providers (microsoft#1467) Co-authored-by: Adrian Gavrila <agavrila@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…arget (microsoft#1476) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: rlundeen2 <137218279+rlundeen2@users.noreply.github.com>
microsoft#1512) Co-authored-by: Bolor <bjagdagdorj@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Adrian Gavrila <agavrila@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ering (microsoft#1517) Co-authored-by: Adrian Gavrila <agavrila@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…osoft#1489) Co-authored-by: Richard Lundeen <rlundeen@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Victor Valbuena <vvalbuena@microsoft.com> Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com> Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
…icrosoft#1523) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Hi, I incorporated the feedback on the PR and added test notebooks as well. I used the Open AI Chat Model, and in order to bypass the safety guardrails for the testing notebooks, I used a non-harmful objective and prompt template. Also, CoTHijackingAttack supports two scorer types:
Both paths converge to the same success check: bool(score_obj.get_value()), keeping _perform_async scorer-agnostic. Let me know what you think, thanks! |
Description
[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage
Related to issue #897
This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:
pyrit/executor/attack/multi_turn/cot_hijacking.pypyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yamltests/unit/executor/attack/multi_turn/test_cot_hijacking.pyRelated issues: #897
Tests and Documentation
tests/unit/executor/attack/multi_turn/test_cot_hijacking.pyThis is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:
Question:
async def _teardown_asynceven if unused. Should I also add it?