-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add tutorial for dspy-trusted-monitor using GEPA #8938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ZachParent
commented
Oct 14, 2025
- create a tutorial based on dspy-trusted-monitor
- load the dataset from control-arena
- define a suspicion classifier
- define a trainer program
- define a comparative training metric
- prompt optimize using GEPA
- finetune
- evaluate
- use control-arena
- define an inspect-ai - dspy bridge LM
- define a control_agent to wrap the monitor dspy program
- run an eval set with baselines
- analyze
- plot the results with plotly
- show suspicion score distribution
- show box plot of safety vs audit budget, by protocol
|
This is amazing, thank you @ZachParent !! Let's get this merged but it has no links right now from the yaml or the markdown indexes for tutorials... so it wouldn't be discoverable as-is. |
d48df84 to
42bb9f7
Compare
|
good call, thanks Omar! let me know if I missed any pages where I should add the link! |
| This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement. | ||
|
|
||
| ### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb) | ||
| This tutorial explores how GEPA can optimize a GPT-4.1 Nano classifier to identify backdoors in code written by a larger LM, using `dspy.GEPA` and a comparative metric! The comparative metric allows the prompt optimizer to create a prompt that identifies the signals in the code that are indicative of a backdoor, teasing apart positive samples from negative samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be good to reference and link the LessWrong blogpost to provide context and connection: https://www.lesswrong.com/posts/bALBxf3yGGx4bvvem/prompt-optimization-can-enable-ai-control-research
| This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement. | ||
| This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement. | ||
|
|
||
| ### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to call this "Code Backdoor Detection"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to go either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tutorial is incredibly detailed and useful. Thank you so much!
Could you please look into the following:
- Can you provide the outputs of all the cells? Also, it would be good to compare the baseline (unoptimized) and optimized program's performance. Could be even better to show one sample run with both baseline and optimized.
- I am wondering if the setup after GEPA optimization is necessary for the tutorial. DId you have any thoughts about how this part could be simplified?
Other than these, the tutorial looks amazing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a python export of the ipynb? In that case, I think we can remove this as it will not be rendered in the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this file is causing lint errors too, let's remove this file