Conversation
9c57623 to
f57ae81
Compare
f57ae81 to
deb71c1
Compare
|
The WANDB job : https://wandb.ai/sharma-sandeepch/trlx/runs/f7ym4m9y?workspace=user-sharma-sandeepch (updated the link to point to a run with a larger batch size) |
|
@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise? |
Thank you so much for the great PR. We will review this PR asap! |
|
Thank you so much @PhungVanDuy for reviewing 🙏 Yes it's the same wandb run i shared above. Here you go : https://wandb.ai/sharma-sandeepch/trlx/runs/f7ym4m9y?workspace=user-sharma-sandeepch |
| from_fn = AutoModelForCausalLM.from_pretrained | ||
| if issubclass(type(config.model.model_path), PretrainedConfig): | ||
| from_fn = AutoModelForCausalLM.from_config |
There was a problem hiding this comment.
AutoModelForSeq2SeqLM support is missing here
|
Any update? |
@PhungVanDuy were you able to review this? |
I saw your wandb but actually the chart quite messup, seems reward/accuracies and reward/margin not clearly increase. I guess because you used gpt2 instead of an SFT model on HH to do DPO. Can you use this SFT model and this preference dataset to train with this branch? |
That's indeed what I did for a quick iteration and because I was limited on the compute i had. I will run it with the mistral-7b on the ultrafeedback dataset and get back asap. |
6404f83 to
506fbbd
Compare
506fbbd to
6d63004
Compare
|
@PhungVanDuy sorry for the delay, the gpus aren't always available. Here is a dpo run (ongoing) of 1 epoch with Note :
|
Thank you for your information, I will use SFT-beta, to check this. Let me help you to run on my cluster. |
@sandeepchittilla can you add my discord with the handle: |
|
I'm excited about DPO support and I hope it'll be added soon! |
Closes #504
This PR adds Direct Policy Optimization as introduced in https://arxiv.org/abs/2305.18290
Loss calculation and concatenated forward pass implementations are adapted from the original TRL library