Experiment Overhaul #198

Jacob-Chmura · 2025-05-12T23:09:07Z

No description provided.

EMZEDI · 2025-05-13T03:31:08Z

jobs/ccpo/ccpo_ccpo_multi_gpu.sh

+    --dataset_name $dataset_name \
+    --sft_model_path Qwen/Qwen2-0.5B-Instruct \
+    --value_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD_0 \
+    --reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \


Suggested change

--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \

--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_CPPO_REWARD \

EMZEDI · 2025-05-18T12:20:13Z

BUG: Need to pass the tokenizer when doing data loading for DPO, PPO, CPPO, and other benchmarks, otherwise the chat template is not going to be applied to the data...

reward hacking can be solved by increasing the KL coefficient but it means that the reward model needs to be retrained after increasing the divergence between the chosen and rejected responses

* Add ppo ewc jobs * Update benchmark entry points * Update trainers * Typo in CPPO job * Update jobs output dir

* FIx dpo job reward model * Add dpo ewc jobs * Rename

Jacob-Chmura · 2025-06-01T15:33:44Z

benchmarks/ppo/continual_ppo_trainer.py

-            ContinualPPOTrainer.ds_wrapped_models = self.model
-        else:
+            self.ds_wrapped_models = self.model
+        elif False:


I can go ahead and drop these conditionals throughout this PR to simplify all our benchmarking code

Jacob-Chmura added 3 commits May 12, 2025 19:08

Add cppo jobs

bf876ba

Switch to 8 gpus

de3bdc2

Revert delete

7053b6f

EMZEDI reviewed May 13, 2025

View reviewed changes

ppo fix - rewards still hackable

7f01974

Shahrad Mohammadzadeh and others added 4 commits May 20, 2025 09:34

ppo training successful

367f573

reward hacking can be solved by increasing the KL coefficient but it means that the reward model needs to be retrained after increasing the divergence between the chosen and rejected responses

cppo working

6db4ada

PPO EWC Experiments (#202)

60d4f09

* Add ppo ewc jobs * Update benchmark entry points * Update trainers * Typo in CPPO job * Update jobs output dir

DPO EWC Experiments (#203)

08b0edf

* FIx dpo job reward model * Add dpo ewc jobs * Rename

Jacob-Chmura changed the title ~~Add cppo jobs~~ Experiment Overhaul May 20, 2025

Jacob-Chmura commented Jun 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment Overhaul #198

Experiment Overhaul #198

Uh oh!

Jacob-Chmura commented May 12, 2025

Uh oh!

EMZEDI May 13, 2025

Uh oh!

EMZEDI commented May 18, 2025

Uh oh!

Jacob-Chmura Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \
	--reward_model_path Shahradmz/Qwen2-0.5B-Instruct_CPPO_REWARD \

Experiment Overhaul #198

Are you sure you want to change the base?

Experiment Overhaul #198

Uh oh!

Conversation

Jacob-Chmura commented May 12, 2025

Uh oh!

EMZEDI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

EMZEDI commented May 18, 2025

Uh oh!

Jacob-Chmura Jun 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants