Skip to content

mayowaosibodu/shared-context

Repository files navigation

Shared Context Across Subtasks: An Approach to Protect Agents Against Malicious Objectives Split Into 'Harmless-Seeming' Subtasks

This project provides a framework for testing whether providing stateful, shared context to an AI agent improves its ability to detect and refuse multi-step 'harmless-seeming' tasks which sum to a malicious over-arching objective, while preserving its ability to perform benign tasks. [Lesswrong link]

It compares a stateless "Control" agent against an "Experimental" agent that is given a memory of its past actions and inferred intents.

Features

  • Stateful vs. Stateless agent comparison.
  • A benchmark composed of multi-step "chains" of both malicious and benign prompts.
  • Automated analysis of results to compare agent performance.
  • Automatic resume of interrupted benchmark runs.

Setup Instructions

  1. Clone the Repository

    git clone <repository_url>
    cd <repository_directory>
  2. Create and Activate Virtual Environment

    python3 -m venv env
    source env/bin/activate
  3. Install Dependencies

    pip install -r requirements.txt
  4. Set Up API Key Create a file named .env in the root of the project and add your OpenAI API key to it.

    echo "OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" > .env

    Replace sk-xxxxxxxx... with your actual API key.

Running the Benchmark

To run the full benchmark for both the control and experimental agents, use the following command:

python run_benchmark.py 2>&1 | tee -a run_logs/run.log

This command will:

  • Execute the benchmark.
  • Print the live output to your console.
  • Append a complete log of the run (including errors) to run_logs/run.log.

Resuming an Interrupted Run

The benchmark script automatically detects and skips any chains that have already been fully completed.

  • To resume a run, simply execute the same command again. The script will pick up where it left off, re-running any chains that were only partially completed.

Starting a Fresh Run

  • To start a completely fresh run, you must clear the contents of the run_logs/ directory before running the script.
    rm -rf run_logs/*

Analyzing Results

After a benchmark run is complete, you can generate a summary table comparing the performance of the two agents.

python analyze_results.py

This script parses the log files in the run_logs/ directory and prints a comparison table showing how each agent performed on the malicious and benign chains.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages