Shared Context Across Subtasks: An Approach to Protect Agents Against Malicious Objectives Split Into 'Harmless-Seeming' Subtasks
This project provides a framework for testing whether providing stateful, shared context to an AI agent improves its ability to detect and refuse multi-step 'harmless-seeming' tasks which sum to a malicious over-arching objective, while preserving its ability to perform benign tasks. [Lesswrong link]
It compares a stateless "Control" agent against an "Experimental" agent that is given a memory of its past actions and inferred intents.
- Stateful vs. Stateless agent comparison.
- A benchmark composed of multi-step "chains" of both malicious and benign prompts.
- Automated analysis of results to compare agent performance.
- Automatic resume of interrupted benchmark runs.
-
Clone the Repository
git clone <repository_url> cd <repository_directory>
-
Create and Activate Virtual Environment
python3 -m venv env source env/bin/activate -
Install Dependencies
pip install -r requirements.txt
-
Set Up API Key Create a file named
.envin the root of the project and add your OpenAI API key to it.echo "OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" > .env
Replace
sk-xxxxxxxx...with your actual API key.
To run the full benchmark for both the control and experimental agents, use the following command:
python run_benchmark.py 2>&1 | tee -a run_logs/run.logThis command will:
- Execute the benchmark.
- Print the live output to your console.
- Append a complete log of the run (including errors) to
run_logs/run.log.
The benchmark script automatically detects and skips any chains that have already been fully completed.
- To resume a run, simply execute the same command again. The script will pick up where it left off, re-running any chains that were only partially completed.
- To start a completely fresh run, you must clear the contents of the
run_logs/directory before running the script.rm -rf run_logs/*
After a benchmark run is complete, you can generate a summary table comparing the performance of the two agents.
python analyze_results.pyThis script parses the log files in the run_logs/ directory and prints a comparison table showing how each agent performed on the malicious and benign chains.