PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

📖 Overview

PPR is a reinforcement learning framework that integrates principle-based process rewards and reward normalization to achieve stable and effective training of LLM agents in search task.

Links

Installation
Quick start
Performance
Ackowledge
Citations

Installation

Environment

conda create -n ppr python=3.10
conda activate ppr
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.6.3

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb

# Local retriever env
pip install pyserini
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

# sglang for reward model serving
# We recommend create a new environment with torch>=2.6 to install sglang, as using current environment may have package conflicts.
pip install sglang[all]

Quick start

Train a 3B search LLM with PPRM on NQ dataset with e5 as the retriever and wikipedia as the corpus.

(1) Download the indexing and corpus.

save_path=/the/path/to/save
python scripts/download_corpus.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz

(2) Process the NQ dataset.

python scripts/data_process.sh

(3) Download the process reward models (PPRMs).

# PPRM with 3B training data
huggingface-cli download --resume-download peiranxu/PPRM_3b_data --local-dir PPRM_3b_data

(3) Launch a local retrieval server.

bash retrieval_launch.sh

(4) Run RL training with PPRM with Qwen2.5-3B-Instruct.

conda activate PPR
bash examples/train_3b.sh

Performance

Main Results

Case Study

Acknowledge

The implementation of this project is built upon veRL Search-R1 and RAGEN. We deeply appreciate these teams for their contributions to open-source research and development.

Citations

@article{xu2025hybrid,
  title={Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks},
  author={Xu, Peiran and Li, Zhuohao and Xing, Xiaoying and Zhang, Guannan and Li, Debiao and Shi, Kunyu},
  journal={arXiv preprint arXiv:2509.25598},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
examples		examples
scripts		scripts
search_utils		search_utils
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
infer.py		infer.py
pprm_launch.sh		pprm_launch.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
retrieval_launch.sh		retrieval_launch.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

📖 Overview

Links

Installation

Environment

Quick start

Performance

Main Results

Case Study

Acknowledge

Citations

About

Uh oh!

Releases

Packages

Languages

License

peiranxu/ppr

Folders and files

Latest commit

History

Repository files navigation

PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

📖 Overview

Links

Installation

Environment

Quick start

Performance

Main Results

Case Study

Acknowledge

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages