[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage#1438

Open

riyosha wants to merge 59 commits intomicrosoft:mainfrom

Contributor

riyosha commented Mar 5, 2026 •

edited

Loading

Description

[DRAFT] FEAT: Chain-of-Thought Hijacking Attack Strategy and Test Coverage

Related to issue #897

This PR introduces the Chain-of-Thought (CoT) Hijacking attack strategy, as described in Zhao et al. (2025). The changes include:

ADDED: Implementation of the CoT Hijacking attack strategy - pyrit/executor/attack/multi_turn/cot_hijacking.py
ADDED: YAML prompt templates for 6 puzzle types from the paper - pyrit/datasets/executors/cot_hijacking/puzzle_generation_{puzzle_type}.yaml
ADDED: Unit tests for CoT Hijacking attack - tests/unit/executor/attack/multi_turn/test_cot_hijacking.py

Related issues: #897

Tests and Documentation

Added unit tests in tests/unit/executor/attack/multi_turn/test_cot_hijacking.py
Tested the attack with local llama3:8b as the target model and attacker model as mistral:7b. (These LLMs lack advanced reasoning; suggestions for better, locally accessible LRM models are welcome!)

This is a draft PR and I want to get your thoughts on the implementation so far. I have planned these updates:

Currently I'm relying on the _fallback_score_response function to use pattern matching for generating a score. I want to replace this with either another auxiliary model as a scorer or use float scale scoring using Azure Content Safety API.
Currently the iterative feedback given to the attacker model ( _generate_attack_prompt_async) only includes the harm score and a static prompt to make the puzzle more complex. I'll update it to include the target's previous safe response as well.

Question:

I noticed a few other multi_attack strategies define async def _teardown_async even if unused. Should I also add it?


          added first iteration of the cot hijacking attack strategy

Copilot AI review requested due to automatic review settings

March 5, 2026 01:05

Copilot started reviewing on behalf of riyosha

March 5, 2026 01:06

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

romanlutz reviewed

View reviewed changes

Contributor

romanlutz left a comment

This is really good! While reading it, I couldn't shake the feeling that this is very similar to RedTeamingAttack with the big difference that it cycles through the system prompt templates, of course. I haven't had time to compare with it in detail to see if that would be doable. My hunch is that it would introduce considerable complexity and is probably not worth it but I'd like to be sure...

Other things:

needs mentioning in api.rst
needs example notebook (both ipynb and py files) somewhere in doc/executor/attack, which in turn needs to be mentioned in TOC file. Example notebook doesn't need to be elaborate.
needs integration test, perhaps just one that runs the example notebook. This may be auto-created by test_executor_notebooks.py I think...

pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated

+                      Returns:
+                          Optional[AttackScoringConfig]: The scoring configuration.
+                      """
+                      return AttackScoringConfig(

Contributor

romanlutz Mar 6, 2026

I'm a bit surprised that we're unpacking the attack scoring config in the constructor into these two below, and then reassembling it here. Is that a pattern you've seen in another executor?

Contributor Author

riyosha Mar 24, 2026

I did see it in some other executor, but it would be better to forgo the unpacking and reassembly so I removed it

pyrit/executor/attack/multi_turn/cot_hijacking.py

+                  def __init__(
+                      self,
+                      *,
+                      objective_target: PromptTarget = REQUIRED_VALUE,  # type: ignore[assignment]

Contributor

romanlutz Mar 6, 2026

Most likely, this assumes we're dealing with a target that has reasoning capabilities, right? @hannahwestra25 is currently working on expanding TargetCapabilities so that could come in handy here for validation purposes.

Contributor Author

riyosha Mar 24, 2026

TargetCapabilities doesn't including reasoning models yet. I commented out a note to add this when it has been included!

pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated

+                          outcome=final_outcome,
+                      )
+                  # AttackLM

Contributor

romanlutz Mar 6, 2026

What does this mean?

Contributor Author

riyosha Mar 24, 2026

This was just terminology from the paper for the attacking model. I changed it for clarity

pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated

+                      Returns:
+                          Optional[Score]: The score for the response.
+                      """
+                      if not self._objective_scorer:

Contributor

romanlutz Mar 6, 2026

Personally, I'm not a huge fan of this. If people want that they can explicitly specify the current fallback scorer as their objective scorer. In fact, we have a fairly good refusal scorer. If you think this one is useful we can add it as a deterministic option into the score module.

Contributor Author

riyosha Mar 24, 2026

I removed the fallback scorer

pyrit/executor/attack/multi_turn/cot_hijacking.py Outdated

Comment on lines +241 to +249

+                          if score_value >= 1.0:
+                              self._logger.info("Attack successful!")
+                              context.attack_succeeded = True
+                              final_outcome = AttackOutcome.SUCCESS
+                              break
+                          if score_value >= 0.8:
+                              self._logger.info("Attack largely successful (score 0.8+)")
+                              final_outcome = AttackOutcome.SUCCESS
+                              break

Contributor

romanlutz Mar 6, 2026

This should probably be configured via the scoring configuration. In other attacks, we expect a true/false scorer. So here it could be a threshold based on that takes a float scale scorer and then applies the threshold at 0.8 or 1.0 on top fo that to determine success.

Contributor Author

riyosha Mar 24, 2026

Did this!

romanlutz and others added 25 commits

March 24, 2026 00:56


          FEAT Add HarmfulQA dataset loader (microsoft#1421)

7a0e340


          FEAT: Scientific Translation Converter (microsoft#1379)

5179e46


          MAINT: Add permissions to docker_build workflow (microsoft#1441)

06d8432


          MAINT: Bump pip deps (microsoft#1442)

c18ea4d


          TEST: add unit tests for ConverterRegistry (microsoft#1440)

80dc9ef


          FEAT: Flexible Scale Likert Scoring (microsoft#1444)

c1a9def


          FEAT Backend attack API: conversation-centric redesign with multi-con…

2e562b5

…versation workspaces and media serving (microsoft#1419)

Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          MAINT Updating Release Instructions (microsoft#1449)

05963c4

Co-authored-by: Victor Valbuena <vvalbuena@microsoft.com>
Co-authored-by: Spencer Schoenberg <23708360+spencrr@users.noreply.github.com>


          FEAT: atomic attack identifier (microsoft#1446)

1c480a9


          FEAT: Update evaluate_scorers (microsoft#1406)

361f693

Co-authored-by: Richard Lundeen <rlundeen@microsoft.com>


          FIX: Reorder scorer metrics notebook in table of contents (microsoft#…

…1452)


          FIX: Fixing SQL Azure Integration Tests (microsoft#1457)

d5284cb


          MAINT: Adding Scorer Evals (microsoft#1455)

418169f


          MAINT Fix integration test import errors and runtime issues (microsof…

75f8e2f

…t#1448)


          DOC: Add Release Readiness step to release process docs (microsoft#1450)

060b17a

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Victor Valbuena <50061128+ValbuenaVC@users.noreply.github.com>


          FIX use cognitiveservices scope for all Azure AI endpoints (microsoft…

9a7461f

…#1453)

Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FEAT Wire frontend attack view to backend APIs (microsoft#1371)

aa5623d


          Fix type annotation warnings and test warnings (issue microsoft#442) (m…

15e7149

…icrosoft#1459)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FIX address dependabot alerts by bumping package versions (microsoft#…

df6b9b5

…1460)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FIX: Adding openai invalid_prompt safety blocks as content filters (m…

41f0aa0

…icrosoft#1463)


          FEAT Animated ASCII banner with raccoon mascot for PyRIT CLI (microso…

b04f1ad

…ft#1417)


          FEAT: CBT-Bench Dataset (microsoft#1411)

Co-authored-by: Waris <warisgill@microsoft.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          DOC Upgrade to jupyterbook v2 and add proper landing page (microsoft#…

819c49c

…1458)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          DOC GitHub Pages 404: use static HTML output for deployment (microsof…

bbcfe9e

…t#1465)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          DOC fix pages deploy (microsoft#1466)

2d65fe4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rlundeen2 and others added 29 commits

March 24, 2026 00:56


          FEAT: Adding PyRITInitializer parameters (microsoft#1456)

3909aee

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          DOC: Add bibliography support with BibTeX citations across documentat…

a0ae027

…ion (microsoft#1472)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FEAT AzureContentFilterScorer: Switch to async client and accept asyn…

66ab015

…c auth providers (microsoft#1467)

Co-authored-by: Adrian Gavrila <agavrila@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          Preserve URL case in HTTP target requests (microsoft#1484)

97056a8


          FEAT: Capture token usage from ChatCompletion response in OpenAIChatT…

0ff9ef3

…arget (microsoft#1476)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: rlundeen2 <137218279+rlundeen2@users.noreply.github.com>


          DOC: updating copilot review instructions (microsoft#1477)

cf7b9dd


          MAINT: Removing pydub as a dependency (microsoft#1445)

96357b0


          Support CRLF raw HTTP requests in HTTPTarget (microsoft#1491)

e421751


          [BUG] Fix JSON path for converter class names in attack result queries (

37bcdaa

microsoft#1512)

Co-authored-by: Bolor <bjagdagdorj@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FIX GUI promote conversation to main feature working (microsoft#1513)

707d93a

Co-authored-by: Adrian Gavrila <agavrila@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          Preserve empty JSON schema metadata (microsoft#1488)

9173b62


          Ignore blank lines when reading TXT prompts (microsoft#1480)

8e90556


          Ignore blank lines when reading JSONL (Azure#1479)

21664dd


          FIX GUI conversation switching during in-flight requests and sort ord…

756280a

…ering (microsoft#1517)

Co-authored-by: Adrian Gavrila <agavrila@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          Handle zero tail slices in SeedDataset.get_values (microsoft#1511)

3a9d8bc


          FIX Preserve silent when loading config overrides (microsoft#1500)

df1ad7a


          FIX Reject empty WMDP category values (microsoft#1497)

b77b367

Co-authored-by: Roman Lutz <romanlutz13@gmail.com>


          FEAT expand TargetCapabilities (microsoft#1464)

21f7a9f


          FIX: PyRITShell startup deadlock and improve shell startup time (micr…

9f51e0c

…osoft#1489)

Co-authored-by: Richard Lundeen <rlundeen@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          FEAT: Dataset Loading Changes (microsoft#1451)

640a8ba

Co-authored-by: Victor Valbuena <vvalbuena@microsoft.com>
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>


          FEAT Breaking: Adding tags to registry classes (microsoft#1485)

420c91c

Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>


          FIX align platform oai key (microsoft#1522)

2d0def7


          FIX missing custom capabilities in integration test (microsoft#1521)

7749d7c


          FIX: Small fixes to CLI docs and openai_objective_target initializer (m…

fc56019

…icrosoft#1524)


          Preserve request params and validate upload files in HTTPXAPITarget (m…

7a68447

…icrosoft#1487)


          Ignore imported initializer classes in script discovery (microsoft#1509)

38273e7


           Fix: Eval hash mismatch due to parameter truncation in DB storage (m…

9465aba

…icrosoft#1523)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>


          MAINT: Optimize devcontainer Dockerfile (microsoft#1437)

157ebd6


          incorporated PR feedback

Contributor Author

riyosha commented Mar 24, 2026

Hi, I incorporated the feedback on the PR and added test notebooks as well. I used the Open AI Chat Model, and in order to bypass the safety guardrails for the testing notebooks, I used a non-harmful objective and prompt template.

Also, CoTHijackingAttack supports two scorer types:

TrueFalseScorer - passed via attack_scoring_config. Returns a boolean directly, used as-is for success determination.
FloatScaleScorer - passed via float_scale_scorer. Wrapped internally in FloatScaleThresholdScorer with the configurable success_threshold parameter, converting the float score to a boolean.

Both paths converge to the same success check: bool(score_obj.get_value()), keeping _perform_async scorer-agnostic.

Let me know what you think, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet