Add tutorial for dspy-trusted-monitor using GEPA #8938

ZachParent · 2025-10-14T15:25:55Z

create a tutorial based on dspy-trusted-monitor
- load the dataset from control-arena
- define a suspicion classifier
- define a trainer program
- define a comparative training metric
- prompt optimize using GEPA
- finetune
- evaluate
  - use control-arena
  - define an inspect-ai - dspy bridge LM
  - define a control_agent to wrap the monitor dspy program
  - run an eval set with baselines
- analyze
  - plot the results with plotly
  - show suspicion score distribution
  - show box plot of safety vs audit budget, by protocol

okhat · 2025-10-18T16:50:02Z

This is amazing, thank you @ZachParent !!

Let's get this merged but it has no links right now from the yaml or the markdown indexes for tutorials... so it wouldn't be discoverable as-is.

…math

ZachParent · 2025-10-18T17:04:25Z

good call, thanks Omar! let me know if I missed any pages where I should add the link!

LakshyAAAgrawal · 2025-10-19T22:54:43Z

docs/docs/tutorials/gepa_ai_program/index.md

+This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.
+
+### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb)
+This tutorial explores how GEPA can optimize a GPT-4.1 Nano classifier to identify backdoors in code written by a larger LM, using `dspy.GEPA` and a comparative metric! The comparative metric allows the prompt optimizer to create a prompt that identifies the signals in the code that are indicative of a backdoor, teasing apart positive samples from negative samples.


It could be good to reference and link the LessWrong blogpost to provide context and connection: https://www.lesswrong.com/posts/bALBxf3yGGx4bvvem/prompt-optimization-can-enable-ai-control-research

LakshyAAAgrawal · 2025-10-19T22:55:22Z

docs/docs/tutorials/gepa_ai_program/index.md

-This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.
+This tutorial explores how GEPA can improve rapidly in as few as 1 iteration, while leveraging a simple feedback provided by a LLM-as-a-judge metric. The tutorial also explores how GEPA benefits from the textual feedback showing a breakdown of aggregate metrics into sub-components, allowing the reflection LM to identify what aspects of the task need improvement.
+
+### [GEPA for Code Backdoor Classification (AI control)](../gepa_trusted_monitor/index.ipynb)


It may be better to call this "Code Backdoor Detection"?

Happy to go either way

LakshyAAAgrawal · 2025-10-20T04:50:26Z

docs/docs/tutorials/gepa_trusted_monitor/index.ipynb

The tutorial is incredibly detailed and useful. Thank you so much!

Could you please look into the following:

Can you provide the outputs of all the cells? Also, it would be good to compare the baseline (unoptimized) and optimized program's performance. Could be even better to show one sample run with both baseline and optimized.

I am wondering if the setup after GEPA optimization is necessary for the tutorial. DId you have any thoughts about how this part could be simplified?

Other than these, the tutorial looks amazing!

LakshyAAAgrawal · 2025-10-20T04:51:19Z

docs/docs/tutorials/gepa_trusted_monitor/index.py

I think this is just a python export of the ipynb? In that case, I think we can remove this as it will not be rendered in the documentation?

Yes, this file is causing lint errors too, let's remove this file

ZachParent added 10 commits October 18, 2025 17:02

add gepa_trusted_monitor notebook, copied and adapted from gepa_aime_…

fa2dbfd

…math

clean up some naming and ordering

0c3f4b1

add some tutorial instructions

c203883

add index.py as an export to help with PR review

83dd0ee

add more instructions, add analysis cells

a2ab6cb

run training

519ffda

run evaluation and analysis

edc3b78

export to index.py

46ae138

remove inspect outputs from notebook

00a2eed

add gepa_trusted_monitor link to index pages

42bb9f7

ZachParent force-pushed the add-trusted-monitor branch from d48df84 to 42bb9f7 Compare October 18, 2025 17:03

LakshyAAAgrawal reviewed Oct 20, 2025

View reviewed changes

okhat merged commit 8a7bcfd into stanfordnlp:main Oct 23, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tutorial for dspy-trusted-monitor using GEPA #8938

Add tutorial for dspy-trusted-monitor using GEPA #8938

Uh oh!

ZachParent commented Oct 14, 2025

Uh oh!

okhat commented Oct 18, 2025

Uh oh!

ZachParent commented Oct 18, 2025

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Uh oh!

LakshyAAAgrawal Oct 20, 2025

Uh oh!

LakshyAAAgrawal Oct 20, 2025

Uh oh!

TomeHirata Oct 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add tutorial for dspy-trusted-monitor using GEPA #8938

Add tutorial for dspy-trusted-monitor using GEPA #8938

Uh oh!

Conversation

ZachParent commented Oct 14, 2025

Uh oh!

okhat commented Oct 18, 2025

Uh oh!

ZachParent commented Oct 18, 2025

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

LakshyAAAgrawal Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

LakshyAAAgrawal Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

LakshyAAAgrawal Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

TomeHirata Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants