This repository contains experiments on reinforcement learning fine-tuning (RLFT) for code generation models. We compare Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO) on the deepseek-coder-7b-instruct
model to evaluate improvements in compilation, functional correctness, synthesis, reasoning, and code quality.
-
Goal: Enhance code generation models with RL-based techniques to improve correctness and reasoning.
-
Models:
-
RL Algorithms:
- Proximal Policy Optimization (PPO)
- Group Relative Preference Optimization (GRPO)
We integrate LoRA adapters for parameter-efficient fine-tuning and log training metrics to Weights & Biases (W&B).
Type | Configuration |
---|---|
LoRA Rank | 16 |
Trainable Ratio | 0.54% |
Clip Epsilon | 0.2 |
Value Coefficient | 0.5 |
Max Grad Norm | 0.5 |
Entropy Coefficient | 0.01 |
Training results are tracked in Weights & Biases.
- Mean Rewards: PPO shows higher variance but maintains stronger positive signals compared to GRPO, which stabilizes near zero.
- Compilation Rates: PPO maintains stability, while GRPO degrades after ~100 epochs.
- Functional Correctness: PPO sustains higher scores, while GRPO trends downward.
- Reasoning Ability: PPO improves modestly, GRPO decreases consistently.
π Summary: PPO outperforms GRPO across compilation, functional correctness, and reasoning, though with higher variance. GRPOβs strict reward shaping limits exploration and long-term stability.
# Clone project
git clone https://github.com/noobsiecoder/VeriGenLLM-v2.git
cd VeriGenLLM-v2
# To use PPO and GRPO RLFT
git switch ppo-v0
# To edit code
# Create a new branch and submit pull request
git checkout -b <new-branch-name>
# This project uses uv package manager
# Installation: https://docs.astral.sh/uv/getting-started/installation/
# After installing, sync project with all modules
uv sync
# To run PPO:
# Go to constants.py and change the algorithm in RLFT_TRAIN_CONFIG.rl_algorithm dict
# Note: Currently set to GRPO
# Later, to start script
uv run main.py
All metrics and plots are logged automatically to your Weights & Biases workspace.
The reward function combines multiple criteria:
Where
All training runs are available in Weights & Biases. Example comparison plots:
- Mean Reward
- Compilation Rates
- Functional Correctness Rates
- Reasoning Rates
Pull requests are welcome! Please open an issue first to discuss proposed changes.
This project is licensed under the MIT License.