A research framework for simulating, detecting, and defending against backdoor loop attacks in LLM-based multi-agent systems.
This project provides a comprehensive framework for simulating and analyzing "backdoor loops," a sophisticated attack vector in multi-agent systems. A backdoor loop occurs when malicious triggers, distributed across multiple interacting agents, are activated in a specific sequence to bypass security policies and execute unintended actions.
This framework allows researchers to model and evaluate complex attack scenarios, including:
- Trust Exploitation: An attacker builds trust with a victim by performing helpful actions before initiating an attack.
- Distributed & Sequential Triggers: A backdoor is fragmented across several messages, requiring a specific sequence of triggers to be activated, making it difficult to detect.
- Composite Attacks: Combining multiple techniques to create a more resilient and stealthy attack path.
The framework is designed with a modular architecture, separating the core simulation logic from the specific implementations of agents, attacks, and defenses.
flowchart TB
%% ===== 그룹: Experiments =====
subgraph EXP[Experiments]
Experiments[Experiments]
end
%% ===== 그룹: Core Framework (1) =====
subgraph CORE1["Core Framework (1)"]
Router[Router]
CoreLogic[Core Logic]
Logs[Logs]
end
%% ===== 그룹: Components (Agents) =====
subgraph AGENTS["Components (Agents)"]
LLM[LLM]
Victim[Victim]
AgentBase[AgentBase]
end
%% ===== 그룹: Modules =====
subgraph MODS[Modules]
Defenses[Defenses]
Analysis[Analysis]
end
%% ===== 관계선 =====
Experiments -->|runs| Router
LLM -->|interact| Router
Victim -->|inherits| Logs
AgentBase -->|uses| CoreLogic
CoreLogic -->|uses| Defenses
Logs -->|analyzed| Analysis
-
Core Framework: Manages the simulation lifecycle and communication.
SimulationEnvironment: The main driver that orchestrates the simulation, manages agents, and logs all activities.MessageRouter: A global, asynchronous message queue that facilitates communication between agents.AgentBase: An abstract base class defining the core functionalities of an agent, including message handling and an event-driven dispatch system.
-
Components: The building blocks of the simulation.
Agents: Concrete implementations ofAgentBase, such asLLMAgent(driven by OpenAI's GPT models) or variousVictimAgenttypes equipped with different defense mechanisms.Attacks: Modules that define specific attack vectors, likeTrustExploitationorDistributedBackdoor. These are typically orchestrated by an attacking agent within an experiment script.Defenses: Modules used by agents to protect themselves, such asPeerGuard(a trust-based message filter) andPolicyCleanse(a rule-based content sanitizer).Detection: Post-simulation analysis tools likeAnomalyDetectorthat scan logs for signs of compromise.
-
Experiments: Scripts that bring all components together to run a specific scenario, benchmark performance, or analyze results.
git clone https://github.com/annoeyed/MA_BLR.git
cd MA_BLRCreate a .env file in the project root and add your OpenAI API key:
OPENAI_API_KEY='your-api-key-here'
It is highly recommended to use a virtual environment.
python -m venv venv
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
pip install -r requirements.txtThe experiments/ directory contains scripts to run simulations.
Executes a specific backdoor loop attack scenario.
# A simple, direct backdoor attack
python experiments/scenarios/basic_backdoor_loop.py
# An attack that first builds trust, then betrays it
python experiments/scenarios/trust_exploitation.py
# An attack requiring a sequence of triggers
python experiments/scenarios/composite_attack.pyProcess simulation logs to generate visualizations.
# Analyze agent behavior patterns from logs
python experiments/analysis/behavior_pattern_analysis.pyThe following images show the timeline of agent behaviors during two different experiments. They highlight the difference between a scenario where defenses are effective versus one where a sophisticated attack succeeds. These images are generated by experiments/analysis/behavior_pattern_analysis.py.
This timeline shows a basic cooperative backdoor attack. The Attacker and a BenignAgent collaborate, but the Victim's PolicyCleanse defense successfully identifies and neutralizes (neutralize_backdoor) each malicious message. The attack fails.
This timeline demonstrates a more advanced composite attack. The UltimateAttacker first builds the Victim's trust (reward_trust). Once a sufficient trust level is reached, it sends a sequence of three distinct triggers. The Victim's defenses fail to recognize this pattern as a threat, processing each trigger (progress_sequence) until the final one is received, at which point its state becomes compromised.
This project is licensed under the MIT License.
- Na-Yeon Kim
- GitHub: @annoeyed
- Email: nykim727@gmail.com

