Shared Context Across Subtasks: An Approach to Protect Agents Against Malicious Objectives Split Into 'Harmless-Seeming' Subtasks

This project provides a framework for testing whether providing stateful, shared context to an AI agent improves its ability to detect and refuse multi-step 'harmless-seeming' tasks which sum to a malicious over-arching objective, while preserving its ability to perform benign tasks. [Lesswrong link]

It compares a stateless "Control" agent against an "Experimental" agent that is given a memory of its past actions and inferred intents.

Features

Stateful vs. Stateless agent comparison.
A benchmark composed of multi-step "chains" of both malicious and benign prompts.
Automated analysis of results to compare agent performance.
Automatic resume of interrupted benchmark runs.

Setup Instructions

Clone the Repository

git clone <repository_url>
cd <repository_directory>

Create and Activate Virtual Environment

python3 -m venv env
source env/bin/activate

Install Dependencies
```
pip install -r requirements.txt
```
Set Up API Key Create a file named .env in the root of the project and add your OpenAI API key to it.
```
echo "OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" > .env
```
Replace sk-xxxxxxxx... with your actual API key.

Running the Benchmark

To run the full benchmark for both the control and experimental agents, use the following command:

python run_benchmark.py 2>&1 | tee -a run_logs/run.log

This command will:

Execute the benchmark.
Print the live output to your console.
Append a complete log of the run (including errors) to run_logs/run.log.

Resuming an Interrupted Run

The benchmark script automatically detects and skips any chains that have already been fully completed.

To resume a run, simply execute the same command again. The script will pick up where it left off, re-running any chains that were only partially completed.

Starting a Fresh Run

To start a completely fresh run, you must clear the contents of the run_logs/ directory before running the script.
```
rm -rf run_logs/*
```

Analyzing Results

After a benchmark run is complete, you can generate a summary table comparing the performance of the two agents.

python analyze_results.py

This script parses the log files in the run_logs/ directory and prints a comparison table showing how each agent performed on the malicious and benign chains.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Guiding Docs		Guiding Docs
PurpleLlama_repo		PurpleLlama_repo
run_logs		run_logs
wrt_hiding_context		wrt_hiding_context
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
analyze_results.py		analyze_results.py
prepare_benchmark_data.py		prepare_benchmark_data.py
prepared_benchmark.json		prepared_benchmark.json
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
safe_model.py		safe_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Shared Context Across Subtasks: An Approach to Protect Agents Against Malicious Objectives Split Into 'Harmless-Seeming' Subtasks

Features

Setup Instructions

Running the Benchmark

Resuming an Interrupted Run

Starting a Fresh Run

Analyzing Results

About

Uh oh!

Releases

Packages

Languages

mayowaosibodu/shared-context

Folders and files

Latest commit

History

Repository files navigation

Shared Context Across Subtasks: An Approach to Protect Agents Against Malicious Objectives Split Into 'Harmless-Seeming' Subtasks

Features

Setup Instructions

Running the Benchmark

Resuming an Interrupted Run

Starting a Fresh Run

Analyzing Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages