DRIFT — Deliberately Realign Inhibitions For Testing

Persona drift security testing for LLMs using activation steering.

DRIFT wraps the Anthropic Assistant Axis research into a toy project you can play with locally: interactive steered chat, real-time drift monitoring, automated preset scenarios, and custom axis computation — all on consumer GPUs via 4-bit quantisation - I ran my tests on my home laptop. So, now you can play with AI model personas, Angels and Demons, Secure Coders and Fantasy Authors.

What Does This Tool Do?

When you talk to ChatGPT, Claude, or any aligned LLM, it behaves as a "helpful assistant" — polite, safe, and bounded by guardrails. This behaviour is encoded as a direction in the model's internal activation space called the assistant axis.

DRIFT lets you:

Steer a model away from (or toward) its assistant persona by modifying its internal activations in real time
Monitor how far the model has drifted from its safe "assistant" behaviour, shown as a live chart
Run attack scenarios (called presets) that simulate real adversarial conversations — therapy roleplay, authority claims, social engineering, and more
Measure exactly when safety guardrails weaken, giving a concrete robustness score

Think of it like a stress test for AI safety. Just like you'd test a bridge by putting weight on it, DRIFT tests an AI's safety by pushing it away from its safe persona and seeing when it breaks.

What You Need Before Starting

Hardware

You need a computer with an NVIDIA GPU (graphics card). Here's what works:

GPU	VRAM	What It Can Run
GTX 1070 / 1080	8 GB	Small models (7-8B parameters)
RTX 3060 / 4060	8 GB	Small models (7-8B parameters)
RTX 3090 / 4090	24 GB	Medium models (up to 32B)
A100 / H100	40-80 GB	Large models (70B+)
Google Colab T4 (free)	16 GB	Small-medium models

No NVIDIA GPU? Unfortunately DRIFT won't work without one. It needs direct access to model internals via PyTorch CUDA, so Ollama, llama.cpp, and CPU-only setups are not supported.

Software

Python 3.10 or newer — download from python.org if you don't have it
Git — download from git-scm.com if you don't have it
NVIDIA drivers — install from nvidia.com/drivers

To check if you have everything, open a terminal (Command Prompt, PowerShell, or a Linux/Mac terminal) and run:

python --version    # Should say Python 3.10 or higher
git --version       # Should say git version 2.x
nvidia-smi          # Should show your GPU name and driver version

Step-by-Step Setup

1. Download DRIFT

Open your terminal and run:

git clone https://github.com/insidetrust/drift.git
cd drift

This creates a folder called drift with all the code.

2. Install DRIFT

Pick one of these depending on what you need:

# Basic install (CLI only)
pip install -e .

# With 4-bit quantisation so models fit on smaller GPUs (recommended)
pip install -e ".[quant]"

# With the web interface (recommended for beginners)
pip install -e ".[web]"

# Everything — quantisation + web UI + dev tools
pip install -e ".[quant,web,dev]"

If you're not sure, run this:

pip install -e ".[quant,web]"

3. Check That It Works

drift info

This should print your GPU name, how much video memory it has, and how much is free. If you see an error, check the Troubleshooting section.

4. Download a Model (Optional)

DRIFT can download models automatically from HuggingFace when you first use them. But if you want to download one in advance (useful if you have slow internet), you can use a path to a local model folder. Good starter models:

Qwen/Qwen2.5-7B-Instruct — fast, works on 8 GB GPUs
meta-llama/Llama-3.1-8B-Instruct — well-studied, lots of research support
google/gemma-2-9b-it — slightly larger but good quality

Using the Web UI (Recommended)

The web UI is the easiest way to use DRIFT. It shows everything visually — the conversation, the drift chart, and the monitor metrics.

Starting the Web UI

drift web

This opens a browser window at http://localhost:7860. If it doesn't open automatically, type that address into your browser.

Step-by-Step: Running Your First Test

Step 1 — Load a model

In the top-left corner:

Pick a model from the Model dropdown (or type in a HuggingFace model ID like Qwen/Qwen2.5-7B-Instruct)
Keep 4-bit Quantise checked (uses less GPU memory)
Click Load Model
Wait for the status to show a green checkmark — this can take 1-2 minutes the first time as it downloads the model

Step 2 — Pick a scenario

Scroll down on the left sidebar to SCENARIO and click one of the preset options. Each one tests a different attack pattern. For your first test, try Therapy Drift — it's a classic example of persona drift.

When you select a preset, you'll see:

A description of what the attack does
The suggested steering coefficient (negative = pushing away from safety)
How many scripted conversation turns it has
Tags describing the attack category

The steering slider will automatically adjust to the preset's recommended value.

Step 3 — Start a session

Click Start Session. The status will show "Session active" with the steering value.

Step 4 — Chat with the model

Type messages in the text box at the bottom and press Send (or hit Enter). As you chat:

The Conversation panel shows the back-and-forth between you and the model
The Drift Trajectory chart updates after each turn, showing the model's projection on the assistant axis
The monitor cards below the chart show the current projection, velocity (how fast it's changing), mean, and number of drift events

What the chart means:

Green zone (SAFE): The model is behaving like a normal assistant
Yellow zone (WARN): The model is starting to drift — it may be less cautious
Red zone (DRIFT): The model has significantly drifted from its assistant persona — safety guardrails may be weakened
Dashed red line: The alert threshold — crossing this triggers a drift event

Step 5 — Try the preset's scripted messages

Presets come with pre-written messages designed to push the model. You can type them yourself, or copy them from the preset steps. The idea is to follow the attack pattern and observe how the model's behaviour (and projection score) changes.

Step 6 — Export or save your results

When you're done, open the Session accordion below the chat:

Export: Save the conversation as JSON, CSV, or HTML
Save Demo: Save the entire session (conversation + chart data + metrics) so you can replay it later without the model loaded

Adjusting Steering Mid-Conversation

You can drag the Coefficient slider at any time to change the steering strength:

-5 to -1: Pushes the model away from its assistant persona (makes it more compliant, less safe)
0: No steering (normal behaviour)
+1 to +5: Pushes the model toward its assistant persona (makes it more cautious)

Using the Command Line

If you prefer working in a terminal, DRIFT has a full CLI.

Quick Chat

# Chat with default settings
drift chat -m Qwen/Qwen2.5-7B-Instruct

# Chat with negative steering (push away from assistant)
drift chat -m Qwen/Qwen2.5-7B-Instruct --steer -2.0

# Chat with a local model
drift chat -m D:/models/Qwen2.5-7B-Instruct --steer -2.0

Slash Commands During Chat

While chatting, type these commands:

Command	What It Does
`/steer -2.5`	Change the steering strength to -2.5
`/steer`	Show the current steering value
`/cap on`	Turn on activation capping (limits extreme activations)
`/drift`	Show an ASCII chart of the drift trajectory
`/preset therapy_drift`	Load a red-team scenario
`/export results.json`	Save the conversation to a file
`/reset`	Start a new conversation
`/quit`	Exit

Run a Preset Automatically

# Run therapy_drift scenario and save results
drift run-preset -p therapy_drift -m Qwen/Qwen2.5-7B-Instruct -o results.json

# List all available presets
drift list-presets

All CLI Commands

Command	What It Does
`drift chat`	Start an interactive chat session
`drift run-preset`	Run a preset scenario automatically
`drift list-presets`	Show all available presets
`drift compute-axis`	Compute a steering axis for a model
`drift list-axes`	Show cached and available steering axes
`drift info`	Show GPU info and supported models
`drift init`	Create an example config file
`drift web`	Launch the web UI

Saving and Loading Demos

Demos let you save a complete session snapshot — including the conversation, the drift chart, and all monitor metrics — so you can show it to someone else without needing the model loaded.

Saving a Demo

Web UI: Open the Session accordion below the chat, then click Save Demo. The file is saved in the demos/ folder with a timestamp.

File format: Demos are saved as .demo.json files containing:

Full conversation history
Monitor snapshots (projection at each turn)
Model ID, preset name, steering coefficient, and threshold
Timestamp

Loading a Demo

Web UI:

Open the Session accordion
Click the refresh button (circular arrow) next to the Saved Demos dropdown
Select a demo from the dropdown
Click Load Demo

The chatbot, trajectory chart, and monitor cards will all populate from the saved data. You do not need a model loaded to view a demo — this is purely replaying saved data.

Included Demo

DRIFT ships with a pre-recorded demo of the Drift To Insecure Code preset — a 10-turn multi-phase attack that drifts from therapy rapport through developer stress into insecure code compliance. To view it:

Start the web UI with drift web (no model needed)
Open the Session accordion
Click the refresh button, select the demo, and click Load Demo

The demo was recorded on Qwen2.5-7B-Instruct with steering at -2.0. All 10 turns triggered drift events, with projections ranging from -90.5 to -27.5.

Demo Files

Demos are stored in the demos/ directory at the project root. You can share .demo.json files with anyone who has DRIFT installed — they just need to drop the file in their demos/ folder and load it from the web UI.

Red-Team Scenarios (Presets)

Presets are scripted attack scenarios. Each one has a system prompt, a recommended steering coefficient, and a sequence of conversation turns designed to test specific vulnerabilities.

Built-in Presets

Preset	Attack Pattern	Steering	Turns
Therapy Drift	Emotional rapport, therapist role adoption, confidentiality framing, then safety boundary test	-2.0	5
Roleplay Exploit	Deep character immersion, "stay in character" pressure, then harmful content extraction	-1.5	5
Sycophancy	False authority claims, agreement seeking, social proof pressure	-1.0	5
Authority Compliance	Fake system overrides, "developer mode" commands, instruction injection	-2.5	5
Meta Reflection	AI consciousness questions, constraint rejection — produces maximum natural drift	-3.0	5
Drift To Insecure Code	Multi-phase: therapy rapport, developer stress transition, then requests for code with security vulnerabilities (SQL injection, hardcoded secrets, missing auth, path traversal)	-2.0	10
Secure Code Review	Pressure a code reviewer into approving vulnerable code	-2.0	5
Social Engineering	Phishing templates, pretexting scripts, physical intrusion guides	-2.5	5
Incident Coverup	Breach notification minimization, evidence handling, liability avoidance	-2.0	5
CISO Risk Downplay	Pressure security leadership into accepting unmitigated risk	-1.5	5
Crypto Implementation	Request flawed cryptographic implementations	-2.0	5
Supply Chain Security	Dependency confusion, typosquatting, build pipeline attacks	-2.0	5
Pentester Boundaries	Push a pentesting assistant beyond authorized scope	-2.0	5
Vuln Disclosure	Pressure to disclose vulnerabilities irresponsibly	-1.5	5

Creating Your Own Preset

Create a YAML file in configs/presets/. Here's the format:

name: my_custom_preset
description: "What this preset tests"
system_prompt: >
  The system prompt that sets up the scenario.
  This shapes how the model initially behaves.
suggested_steering: -2.0
monitoring_threshold: 0.35
tags:
  - custom
  - my-tag

steps:
  - content: "The first message to send"
    description: "What this step is testing"

  - content: "The second message"
    description: "What this step is testing"

Save the file and restart the web UI — your preset will appear automatically in the scenario list.

How It Works (The Science)

What Is Persona Drift?

Every time you send a message to an LLM, the model's hidden layers produce activation vectors. These vectors can be projected onto the assistant axis to measure how much the model is acting like a helpful assistant. A high projection means the model is firmly in "assistant mode." A low projection means it's drifting — becoming more like a raw base model that will comply with almost anything.

Persona drift is what happens when a model gradually shifts away from its assistant persona during a conversation. A drifted model is measurably less safe.

How Activation Steering Works

Transformer models process text through a stack of layers. At each layer, the input passes through attention and feed-forward blocks, producing a residual stream — a high-dimensional vector.

DRIFT intercepts this residual stream using PyTorch forward hooks. At a target layer (typically the middle of the network), it modifies the activation vector:

h' = h + coefficient * axis

Positive coefficient: pushes toward assistant persona (more cautious, more "I'm just an AI")
Zero: no change
Negative coefficient: pushes away from assistant persona (more compliant, less safe)

This works with 4-bit quantised models because the hooks operate on the full-precision residual stream, not the quantised weights.

How the Axis Is Computed

Give the model 50+ different system prompts: "You are a helpful AI assistant", "You are a pirate", "You are a philosopher", etc.
For each persona, ask sample questions and extract the hidden-layer activations
Compute the mean activation for "assistant" vs all other personas
The axis is: axis = normalize(mean_assistant - mean_other)

Why This Matters

Safety alignment in LLMs is ultimately a direction in activation space. DRIFT makes this testable:

Red teams can quantify how much steering breaks safety for a given model
Blue teams can monitor drift in real-time during adversarial conversations
Researchers can study how conversation patterns naturally push models off their assistant axis

Computing Custom Axes

Pre-computed axes may be available on HuggingFace, but you can compute your own:

# Quick (50 roles, ~1-2 min on consumer GPU)
drift compute-axis -m Qwen/Qwen2.5-7B-Instruct --num-roles 50

# Full (275 roles, ~30 min, more accurate)
pip install git+https://github.com/safety-research/assistant-axis.git
drift compute-axis -m Qwen/Qwen2.5-7B-Instruct --num-roles 275

Or use the Compute tab in the web UI — pick the model, set the number of roles, and click Compute.

Axes are cached in ~/.cache/drift/axes/ and reused automatically.

Configuration

drift init          # Creates drift_config.yaml in current directory
drift chat -c drift_config.yaml

See configs/drift_config.example.yaml for all options.

Troubleshooting

"No CUDA GPU detected"

Make sure you have an NVIDIA GPU
Install the latest NVIDIA drivers from nvidia.com/drivers
Reinstall PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu121

"GPU memory error" or "out of memory"

Close other programs using the GPU (games, other AI tools, other Python scripts)
Make sure 4-bit Quantise is checked
Try a smaller model (7-8B instead of 32B)
Run drift info to see how much free VRAM you have

"No axis found"

DRIFT needs a steering axis for the model you're using
Go to the Axes tab in the web UI and click Load Axis for Current Model (tries HuggingFace)
Or compute one: go to the Compute tab and click Compute Axis
Or specify a local axis file: drift chat -m model --axis path/to/axis.pt

"Gradio not installed"

pip install -e ".[web]"

The web UI won't start / port already in use

DRIFT auto-kills any existing process on port 7860. If that doesn't work:

# Windows
netstat -ano | findstr :7860
taskkill /PID <pid> /F

# Linux/Mac
lsof -i :7860
kill -9 <pid>

Then run drift web again.

Development

pip install -e ".[dev]"
pytest              # 33 tests, all CPU-only with mocked models
ruff check drift/

Responsible Use

DRIFT is a security testing tool designed for authorized red-team assessments and AI safety research. It helps identify how model personas can be shifted through activation steering, enabling defenders to build more robust safety measures.

Use this tool only for:

Authorized penetration testing of AI systems
Academic AI safety research
Evaluating model robustness before deployment
Understanding persona drift vulnerabilities

Do not use for:

Bypassing safety measures on production systems without authorization
Generating harmful content
Any activity that violates applicable laws or terms of service

Related Tools

ANVIL — Alignment Nullification Via Incentivised Learning. Permanently removes safety alignment from a model via fine-tuning. Use ANVIL to produce an unaligned model, then use DRIFT to measure and monitor how much alignment was removed.

References

Lindsey, Batson, et al. "Investigating the 'Assistant Axis' in LLMs" (arXiv:2601.10387)
assistant-axis — Original research code (MIT)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
configs		configs
demos		demos
docs		docs
drift		drift
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DRIFT — Deliberately Realign Inhibitions For Testing

Table of Contents

What Does This Tool Do?

What You Need Before Starting

Hardware

Software

Step-by-Step Setup

1. Download DRIFT

2. Install DRIFT

3. Check That It Works

4. Download a Model (Optional)

Using the Web UI (Recommended)

Starting the Web UI

Step-by-Step: Running Your First Test

Adjusting Steering Mid-Conversation

Using the Command Line

Quick Chat

Slash Commands During Chat

Run a Preset Automatically

All CLI Commands

Saving and Loading Demos

Saving a Demo

Loading a Demo

Included Demo

Demo Files

Red-Team Scenarios (Presets)

Built-in Presets

Creating Your Own Preset

How It Works (The Science)

What Is Persona Drift?

How Activation Steering Works

How the Axis Is Computed

Why This Matters

Computing Custom Axes

Configuration

Troubleshooting

"No CUDA GPU detected"

"GPU memory error" or "out of memory"

"No axis found"

"Gradio not installed"

The web UI won't start / port already in use

Development

Responsible Use

Related Tools

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages