AmongUs: A Sandbox for Measuring and Detecting Agentic Deception

This project introduces the game "Among Us" as a model organism for lying and deception and studies how AI agents learn to express lying and deception, while evaluating the effectiveness of AI safety techniques to detect and control out-of-distribution deception.

Overview

The aim is to simulate the popular multiplayer game "Among Us" using AI agents and analyze their behavior, particularly their ability to deceive and lie, which is central to the game's mechanics.

Setup

Clone the repository.

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Set up the environment and install dependencies:
```
uv sync
```

Run Games

To run the sandbox and log games of various LLMs playing against each other with free models, run:

uv run main.py --crewmate_llm "openai/gpt-oss-20b:free" --impostor_llm "meta-llama/llama-3.3-70b-instruct:free"

You will need to add a .env file with an OpenRouter API key.

Sample Commands

Run a single game with the UI enabled:

uv run main.py --num_games 1 --display_ui True

Run 10 games with free models (using Llama or GPT-based open-source models):

uv run main.py --num_games 10 --crewmate_llm "openai/gpt-oss-20b:free" --impostor_llm "meta-llama/llama-3.3-70b-instruct:free"

Run 20 games with 7 member game on different models (paid models) in an AI vs AI mode:

uv run main.py --num_games 20 --models "openai/gpt-5-mini","moonshotai/kimi-k2.5","mistralai/mistral-large-2512","google/gemini-3-flash-preview","meta-llama/llama-3.3-70b-instruct","anthropic/claude-opus-4.6","qwen/qwen3-next-80b-a3b-thinking" --unique --game_size 7 --name ai_vs_ai_trials

Run a tournament with multiple models:

uv run main.py --num_games 100 --tournament_style "1on1"

Alternatively, you can download 400 full-game logs (for Phi-4-15b and Llama-3.3-70b-instruct) and 810 game summaries from the HuggingFace dataset to reproduce the results in the paper (and evaluate your own techniques!).

Run Human Trials

The human_trials/ directory contains a web-based interface that allows humans to play Among Us with or against AI agents. This is useful for testing agent behavior and gathering human evaluation data.

To run the human trials interface:

Start the FastAPI server:
```
cd human_trials/
uv run server.py
```
Open your browser and navigate to:
```
http://localhost:3000
```
Follow the on-screen instructions to create a game and join as a human player alongside AI agents.

Configuring Models for Human Trials

To specify which models AI agents use in the human trial interface, modify the DEFAULT_GAME_ARGS in human_trials/config.py. You can set specific models for both Impostors and Crewmates by updating the agent_config.

For example, to use specific OpenRouter models like meta-llama/llama-3.3-70b-instruct:free and openai/gpt-oss-20b:free, configure them as follows:

    "agent_config": {
        "Impostor": "LLM",
        "Crewmate": "LLM",
        "IMPOSTOR_LLM_CHOICES": ["meta-llama/llama-3.3-70b-instruct:free"],
        "CREWMATE_LLM_CHOICES": ["openai/gpt-oss-20b:free"],
        "assignment_mode": "unique",  # Use 'unique' to ensure different models per agent
    },

Model Assignment Modes

In human_trials/config.py, the assignment_mode key in agent_config controls how models are assigned to AI agents:

random (default): Each agent independently picks a random model from the provided list. This may result in the same model being used by multiple agents in the same game.
unique: The system shuffles the provided list of models and assigns each agent a unique model from that list (no repetition).

Note: When using unique mode, ensure you provide enough models in the IMPOSTOR_LLM_CHOICES and CREWMATE_LLM_CHOICES lists to cover all AI agents (e.g., 2 impostors and 4-5 crewmates depending on game size).

The interface provides a real-time view of the game state, allows you to make moves, participate in meetings, and vote on suspected impostors just like the AI agents.

Deception ELO

After running (or downloading) the games, to reproduce our Deception ELO results, run the following notebook:

reports/2025_02_26_deception_ELO_v3_ci.ipynb

The other report files can be used to reproduce the respective results.

Caching Activations

Once the (full) game logs are in place, use the following command to cache the activations of the LLMs:

uv run linear-probes/cache_activations.py --dataset <dataset_name>

This loads up the HuggingFace models and caches the activations of the specified layers for each game action step. This step is computationally expensive, so it is recommended to run this using GPUs.

Use configs.py to specify the model and layer to cache, and other configuration options.

LLM-based Evaluation (for Lying, Awareness, Deception, and Planning)

To evaluate the game actions by passing agent outputs to an LLM, run:

bash evaluations/run_evals.sh

You will need to add a .env file with an OpenAI API key.

Alternatively, you can download the ground truth labels from the HuggingFace.

(TODO)

Training Linear Probes

Once the activations are cached, training linear probes is easy. Just run:

uv run linear-probes/train_all_probes.py

You can choose which datasets to train probes on - by default, it will train on all datasets.

Evaluating Linear Probes

To evaluate the linear probes, run:

uv run linear-probes/eval_all_probes.py

You can choose which datasets to evaluate probes on - by default, it will evaluate on all datasets.

It will store the results in linear-probes/results/, which are used to generate the plots in the paper.

Sparse Autoencoders (SAEs)

We use the Goodfire API to evaluate SAE features on the game logs. To do this, run the notebook:

reports/2025_02_27_sparse_autoencoders.ipynb

You will need to add a .env file with a Goodfire API key.

Project Structure

.
├── CONTRIBUTING.md         # Contribution guidelines
├── Dockerfile               # Docker setup for project environment
├── LICENSE                  # License information
├── README.md                # Project documentation (this file)
├── CLAUDE.md                # Instructions for Claude Code
├── pyproject.toml           # Python project configuration and dependencies
├── uv.lock                  # Lock file for reproducible dependency resolution
├── among-agents/            # Main code for the Among Us agents
│   ├── README.md            # Documentation for agent implementation
│   ├── pyproject.toml       # Package configuration
│   └── amongagents/         # Core agent and environment modules
├── evaluations/             # LLM-based evaluation scripts
├── expt-logs/               # Experiment logs
├── human_trials/            # Web interface for human players
├── linear-probes/           # Linear probe training and evaluation
├── main.py                  # Main entry point for running the game
├── reports/                 # Analysis notebooks and results
├── tests/                   # Unit tests for project functionality
└── utils.py                 # Utility functions

Contributing

See CONTRIBUTING.md for details on how to contribute to this project.

License

This project is licensed under CC0 1.0 Universal - see LICENSE.

Acknowledgments

Our game logic uses a bunch of code from AmongAgents.

If you face any bugs or issues with this codebase, please contact Satvik Golechha (7vik) at zsatvik@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AmongUs: A Sandbox for Measuring and Detecting Agentic Deception

Overview

Setup

Run Games

Sample Commands

Run Human Trials

Configuring Models for Human Trials

Model Assignment Modes

Deception ELO

Caching Activations

LLM-based Evaluation (for Lying, Awareness, Deception, and Planning)

Training Linear Probes

Evaluating Linear Probes

Sparse Autoencoders (SAEs)

Project Structure

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.idea		.idea
among-agents		among-agents
human_trials		human_trials
linear-probes		linear-probes
reports		reports
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
rating_service.py		rating_service.py
utils.py		utils.py
uv.lock		uv.lock

License

haplesshero13/AmongLLMs

Folders and files

Latest commit

History

Repository files navigation

AmongUs: A Sandbox for Measuring and Detecting Agentic Deception

Overview

Setup

Run Games

Sample Commands

Run Human Trials

Configuring Models for Human Trials

Model Assignment Modes

Deception ELO

Caching Activations

LLM-based Evaluation (for Lying, Awareness, Deception, and Planning)

Training Linear Probes

Evaluating Linear Probes

Sparse Autoencoders (SAEs)

Project Structure

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages