Skip to content

Conversation

@ZachParent
Copy link
Contributor

  • create a tutorial based on dspy-trusted-monitor
    • load the dataset from control-arena
    • define a suspicion classifier
    • define a trainer program
    • define a comparative training metric
    • prompt optimize using GEPA
    • finetune
    • evaluate
      • use control-arena
      • define an inspect-ai - dspy bridge LM
      • define a control_agent to wrap the monitor dspy program
      • run an eval set with baselines
    • analyze
      • plot the results with plotly
      • show suspicion score distribution
      • show box plot of safety vs audit budget, by protocol

@okhat
Copy link
Collaborator

okhat commented Oct 18, 2025

This is amazing, thank you @ZachParent !!

Let's get this merged but it has no links right now from the yaml or the markdown indexes for tutorials... so it wouldn't be discoverable as-is.

@ZachParent
Copy link
Contributor Author

good call, thanks Omar! let me know if I missed any pages where I should add the link!

This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.

### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb)
This tutorial explores how GEPA can optimize a GPT-4.1 Nano classifier to identify backdoors in code written by a larger LM, using `dspy.GEPA` and a comparative metric! The comparative metric allows the prompt optimizer to create a prompt that identifies the signals in the code that are indicative of a backdoor, teasing apart positive samples from negative samples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be good to reference and link the LessWrong blogpost to provide context and connection: https://www.lesswrong.com/posts/bALBxf3yGGx4bvvem/prompt-optimization-can-enable-ai-control-research

This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.
This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.

### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to call this "Code Backdoor Detection"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to go either way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tutorial is incredibly detailed and useful. Thank you so much!

Could you please look into the following:

  1. Can you provide the outputs of all the cells? Also, it would be good to compare the baseline (unoptimized) and optimized program's performance. Could be even better to show one sample run with both baseline and optimized.
  2. I am wondering if the setup after GEPA optimization is necessary for the tutorial. DId you have any thoughts about how this part could be simplified?

Other than these, the tutorial looks amazing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a python export of the ipynb? In that case, I think we can remove this as it will not be rendered in the documentation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file is causing lint errors too, let's remove this file

@okhat okhat merged commit 8a7bcfd into stanfordnlp:main Oct 23, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants