-
Notifications
You must be signed in to change notification settings - Fork 1
Experiment Overhaul #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: testing
Are you sure you want to change the base?
Experiment Overhaul #198
Conversation
jobs/ccpo/ccpo_ccpo_multi_gpu.sh
Outdated
| --dataset_name $dataset_name \ | ||
| --sft_model_path Qwen/Qwen2-0.5B-Instruct \ | ||
| --value_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD_0 \ | ||
| --reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --reward_model_path Shahradmz/Qwen2-0.5B-Instruct_${dataset_name}_REWARD \ | |
| --reward_model_path Shahradmz/Qwen2-0.5B-Instruct_CPPO_REWARD \ |
|
BUG: Need to pass the tokenizer when doing data loading for DPO, PPO, CPPO, and other benchmarks, otherwise the chat template is not going to be applied to the data... |
reward hacking can be solved by increasing the KL coefficient but it means that the reward model needs to be retrained after increasing the divergence between the chosen and rejected responses
* Add ppo ewc jobs * Update benchmark entry points * Update trainers * Typo in CPPO job * Update jobs output dir
* FIx dpo job reward model * Add dpo ewc jobs * Rename
| ContinualPPOTrainer.ds_wrapped_models = self.model | ||
| else: | ||
| self.ds_wrapped_models = self.model | ||
| elif False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go ahead and drop these conditionals throughout this PR to simplify all our benchmarking code
No description provided.