Skip to content

Conversation

@Jacob-Chmura
Copy link
Member

No description provided.

--dataset_name $dataset_name \
--sft_model_path Qwen/Qwen2-0.5B-Instruct \
--value_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD_0 \
--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \
--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_CPPO_REWARD \

@EMZEDI
Copy link
Collaborator

EMZEDI commented May 18, 2025

BUG: Need to pass the tokenizer when doing data loading for DPO, PPO, CPPO, and other benchmarks, otherwise the chat template is not going to be applied to the data...

Shahrad Mohammadzadeh and others added 4 commits May 20, 2025 09:34
reward hacking can be solved by increasing the KL coefficient but it means that the reward model needs to be retrained after increasing the divergence between the chosen and rejected responses
* Add ppo ewc jobs

* Update benchmark entry points

* Update trainers

* Typo in CPPO job

* Update jobs output dir
* FIx dpo job reward model

* Add dpo ewc jobs

* Rename
@Jacob-Chmura Jacob-Chmura changed the title Add cppo jobs Experiment Overhaul May 20, 2025
ContinualPPOTrainer.ds_wrapped_models = self.model
else:
self.ds_wrapped_models = self.model
elif False:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go ahead and drop these conditionals throughout this PR to simplify all our benchmarking code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants