GitHub - ShenzhiYang2000/OLR: Can LLMs Learn to Reason Robustly under Noisy Supervision?

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Online Label Refinement (OLR)

In one sentence

Reinforcement Learning with Verifiable Rewards (RLVR) assumes abundant clean ground-truth labels; in practice, expert scarcity and weak verifiers make noisy labels unavoidable. This repository accompanies the first systematic analysis of noisy-label mechanisms in RLVR and introduces OLR, which refines suspicious labels online during training using majority answers, pass-rate slope, and historical consistency from the policy’s own rollouts—so the model keeps improving reasoning robustly under noisy supervision.

Why it matters

Idea	What it means
Rollout-dependent effect	Unlike standard classification, in RLVR whether a label hurts training depends on whether the current policy can produce rollouts that realize that label—wrong labels inherit the same structure.
Inactive vs. active noise	Inactive: the policy rarely samples the wrong label → mostly wastes rollouts and hurts data efficiency. Active: the policy can sample it → it gets positively reinforced and skews the policy—often more harmful.
Early Correctness Coherence	Empirically, early in training, accuracy on clean and noisy samples rises together; they diverge later. That leaves a window where a reliable majority signal already exists—where OLR steps in.

What OLR does (intuition)

OLR replaces the original (possibly noisy) label with the majority-voted answer when both hold:

Positive slope of the majority answer’s pass rate — rollouts for the same prompt increasingly agree, so the signal stabilizes as a target.
Historical consistency — the majority answer stays dominant across updates, filtering spurious majorities.

Supervision then self-refines as the policy improves. In code, this corresponds to use_olr and the pseudo-label / filtering logic in ray_trainer.

Main results

Across noise ratios 0.1–0.9, OLR yields consistent gains under inactive and active noise:

Setting	In-distribution (avg. over 6 math benchmarks)	Out-of-distribution (ARC-c, GPQA-diamond, MMLU-Pro)
Inactive noise	+3.6%	+3.3%
Active noise	+3.9%	+4.6%

In-distribution benchmarks include AIME24/25, AMC, MATH-500, Minerva, Olympiad, etc.

Repository layout (delta on top of verl)

Piece	Description
Entry point	`python -m verl.trainer.main_olr` (Hydra; default `verl/trainer/config/ppo_trainer.yaml`)
Core logic	`verl/trainer/ppo/ray_trainer.py`: pseudo-labels and pass rates under `weak` / `strong` noise branches; OLR gating when `epoch > start_select_epoch` and `use_olr=True`
Scripts	`exp_script/`: Model + noisy-label data examples
Evaluation	`eval_scripts/`: `eval_best.sh`, etc. (edit paths for your machine)

For verl architecture, dependencies, and tuning, see the documentation.

Requirements

Full install instructions: verl Installation.

Data and Model

This repository contains all the data/ needed for training and testing. The models used for training include Qwen3-4B-Base, Qwen3-8B-Base, and Deepseek-R1-Distill-Llama-8B.

Training

# OLR (use_olr=True)
bash exp_script/run_qwen3-4B-base_noise_label_weak_use_olr.sh

# Baselines (use_olr=False; switch baseline as needed)
bash exp_script/run_qwen3-4B-base_noise_label_unsupervised_baselines.sh

Common Hydra overrides: algorithm.adv_estimator=grpo, trainer.train_mode (weak / strong), +trainer.use_olr, +trainer.start_select_epoch, +trainer.slope_tres, +trainer.baseline. Tune GPU count, batch size, and tensor parallelism for your hardware. The code for noise label learning methods in classification tasks is still being organized and will be updated later.

Evaluation

See eval_scripts/eval_best.sh for a batch example; set ROOT, MODEL_PATH, etc. to your environment.

Citation

If you find our code useful, please give us a star ⭐ or cite us using:

@article{yang2026can,
  title={Can LLMs Learn to Reason Robustly under Noisy Supervision?},
  author={Yang, Shenzhi and Zhu, Guangcheng and Song, Bowen and Li, Sharon and Wang, Haobo and Zheng, Xing and Ma, Yingfan and Chen, Zhongqi and Wang, Weiqiang and Chen, Gang},
  journal={arXiv preprint arXiv:2604.03993},
  year={2026}
}

Acknowledgement

OLR builds upon LUFFY, veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones.

📬 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:

Shenzhi Yang: yangshenzhi@zju.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
data		data
docker		docker
docs		docs
eval_scripts		eval_scripts
examples		examples
exp_script		exp_script
figure		figure
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can LLMs Learn to Reason Robustly under Noisy Supervision?

In one sentence

Why it matters

What OLR does (intuition)

Main results

Repository layout (delta on top of verl)

Requirements

Data and Model

Training

Evaluation

Citation

Acknowledgement

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Can LLMs Learn to Reason Robustly under Noisy Supervision?

In one sentence

Why it matters

What OLR does (intuition)

Main results

Repository layout (delta on top of verl)

Requirements

Data and Model

Training

Evaluation

Citation

Acknowledgement

📬 Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages