-
[2026.03.23] Agent-R1 v0.1.0 marks the first official version of the project. It introduces a fully refactored codebase, the Step-level MDP foundation, and new Layered Abstractions. The previous version has been archived to the
legacybranch. -
[2026.03.04] We've launched Claw-R1, a more advanced framework designed to empower General Agents (OpenClaw etc.) with Agentic RL through a Middleware design. Check it out at AgentR1/Claw-R1.
Agent-R1 is an open-source framework for training powerful language agents with end-to-end reinforcement learning. It is designed for multi-step agent tasks, where the model interacts with environments and tools across multiple rounds instead of producing a single final answer.
The core idea behind Agent-R1 is Step-level MDP: each interaction step is treated as a proper RL transition, with an environment-defined state, an LLM action, and the next observation produced by the environment. This replaces the usual "append everything into one ever-growing token sequence" view with a more principled and more flexible training abstraction.
With Agent-R1, you can build custom agent workflows, define interactive environments and tools, and train multi-step agents in a unified RL pipeline.
Also check out Awesome-Agent-RL: Our curated collection of papers and resources on unlocking the potential of Agents through Reinforcement Learning.
Agent-R1 v0.1.0 is the first official release of the new architecture. It is built to address two common failure modes in RL training for LLM agents:
- Retokenization drift in text-based pipelines: if rollout data is collected as text and later tokenized again for training, the
Token -> Text -> Tokenconversion is not reversible. - Rigid token-only trajectory construction: if the whole interaction is represented as a single growing token list, context handling becomes hard-wired to simple append-only logic.
Agent-R1 addresses these issues with a step-level trajectory representation:
- each step stores its own prompt and response
- the environment, not raw token concatenation, controls the next observation
- context can be truncated, summarized, rewritten, or augmented between steps
- standard RL loops such as
obs -> action -> step -> next_obsmap naturally onto agent training
This makes Agent-R1 a better fit for real multi-step agent tasks with tool use, environment feedback, and flexible context management.
- The default
mainbranch contains the new v0.1.0 architecture based on Step-level MDP and Layered Abstractions. - The previous implementation is preserved in the
legacybranch for reference. - The current version uses the same runtime environment as
verland requiresverl==0.7.0.
Agent-R1 uses the same environment setup as verl, and the current version requires verl==0.7.0. You only need to clone this repository; there is no separate Agent-R1 installation step.
The recommended path is:
- Read the Getting Started page for the minimal setup flow.
- Use
examples/data_preprocess/gsm8k.pyandexamples/run_qwen2.5-3b.shas a sanity check that the environment is wired correctly. - Move to the Agent Task Tutorial for the main Agent-R1 workflow based on multi-step interaction and tool use.
Prepare a minimal GSM8K dataset and run the single-step script:
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
bash examples/run_qwen2.5-3b.shThis stage is only a setup check. It helps confirm that your environment, model path, dataset path, and training stack are wired correctly.
Prepare the tool-augmented dataset and launch the multi-step agent training script:
python3 examples/data_preprocess/gsm8k_tool.py --local_save_dir ~/data/gsm8k_tool
bash examples/run_qwen3-4b_gsm8k_tool.shThis is the main Agent-R1 path, where AgentEnvLoop drives multi-step rollout and ToolEnv handles tool calls and environment feedback.
Core concepts:
Here are some representative projects built on top of Agent-R1:
- TableMind: An autonomous programmatic agent for tool-augmented table reasoning. TableMind is built upon the Agent-R1 framework and leverages its end-to-end reinforcement learning pipeline to train a specialized agent for structured table understanding.
- PaperScout: An autonomous agent for academic paper search built with Agent-R1. It introduces Proximal Sequence Policy Optimization (PSPO), a process-aware method for aligning token-level optimization with sequence-level agent interactions.
This work is conducted at the State Key Laboratory of Cognitive Intelligence, USTC. We gratefully acknowledge the inspiring ideas and early insights from DeepSeek-R1, veRL, and RAGEN, which have significantly influenced the development of Agent-R1. We also sincerely thank Prof. Qi Liu and Prof. Mingyue Cheng for their guidance and support.
If you find Agent-R1 useful in your research, please cite:
@misc{cheng2025agentr1trainingpowerfulllm,
title={Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning},
author={Mingyue Cheng and Jie Ouyang and Shuo Yu and Ruiran Yan and Yucong Luo and Zirui Liu and Daoyu Wang and Qi Liu and Enhong Chen},
year={2025},
eprint={2511.14460},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.14460}
}