Skip to content
/ ppr Public

PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

License

Notifications You must be signed in to change notification settings

peiranxu/ppr

Repository files navigation

PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

Paper Code Model Project

📖 Overview

Overview of PPR

PPR is a reinforcement learning framework that integrates principle-based process rewards and reward normalization to achieve stable and effective training of LLM agents in search task.

Links

Installation

Environment

conda create -n ppr python=3.10
conda activate ppr
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.6.3

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb

# Local retriever env
pip install pyserini
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

# sglang for reward model serving
# We recommend create a new environment with torch>=2.6 to install sglang, as using current environment may have package conflicts.
pip install sglang[all]

Quick start

Train a 3B search LLM with PPRM on NQ dataset with e5 as the retriever and wikipedia as the corpus.

(1) Download the indexing and corpus.

save_path=/the/path/to/save
python scripts/download_corpus.py --save_path $save_path
cat $save_path/part_* > $save_path/e5_Flat.index
gzip -d $save_path/wiki-18.jsonl.gz

(2) Process the NQ dataset.

python scripts/data_process.sh

(3) Download the process reward models (PPRMs).

# PPRM with 3B training data
huggingface-cli download --resume-download peiranxu/PPRM_3b_data --local-dir PPRM_3b_data

(3) Launch a local retrieval server.

bash retrieval_launch.sh

(4) Run RL training with PPRM with Qwen2.5-3B-Instruct.

conda activate PPR
bash examples/train_3b.sh

Performance

Main Results

Case Study

Acknowledge

The implementation of this project is built upon veRL Search-R1 and RAGEN. We deeply appreciate these teams for their contributions to open-source research and development.

Citations

@article{xu2025hybrid,
  title={Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks},
  author={Xu, Peiran and Li, Zhuohao and Xing, Xiaoying and Zhang, Guannan and Li, Debiao and Shi, Kunyu},
  journal={arXiv preprint arXiv:2509.25598},
  year={2025}
}

About

PPR: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published