feat: Add support for DPO by sandeepchittilla · Pull Request #556 · CarperAI/trlx

sandeepchittilla · 2023-09-07T15:10:44Z

Closes #504

This PR adds Direct Policy Optimization as introduced in https://arxiv.org/abs/2305.18290

Loss calculation and concatenated forward pass implementations are adapted from the original TRL library

sandeepchittilla · 2023-09-13T08:26:13Z

The WANDB job : https://wandb.ai/sharma-sandeepch/trlx/runs/f7ym4m9y?workspace=user-sharma-sandeepch

(updated the link to point to a run with a larger batch size)

sandeepchittilla · 2023-09-14T08:56:36Z

@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise?

PhungVanDuy · 2023-09-14T09:03:41Z

@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise?

Thank you so much for the great PR. We will review this PR asap!
Can you share any wandb that you ran?

sandeepchittilla · 2023-09-14T09:06:41Z

Thank you so much @PhungVanDuy for reviewing 🙏

Yes it's the same wandb run i shared above. Here you go : https://wandb.ai/sharma-sandeepch/trlx/runs/f7ym4m9y?workspace=user-sharma-sandeepch

songyouwei · 2023-09-19T13:52:59Z

trlx/trainer/accelerate_dpo_trainer.py

+        from_fn = AutoModelForCausalLM.from_pretrained
+        if issubclass(type(config.model.model_path), PretrainedConfig):
+            from_fn = AutoModelForCausalLM.from_config


AutoModelForSeq2SeqLM support is missing here

LouisCastricato · 2023-11-13T11:42:37Z

Any update?

sandeepchittilla · 2023-11-13T13:20:31Z

@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise?

Thank you so much for the great PR. We will review this PR asap! Can you share any wandb that you ran?

@PhungVanDuy were you able to review this?

PhungVanDuy · 2023-11-13T14:13:14Z

@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise?

Thank you so much for the great PR. We will review this PR asap! Can you share any wandb that you ran?

@PhungVanDuy were you able to review this?

I saw your wandb but actually the chart quite messup, seems reward/accuracies and reward/margin not clearly increase. I guess because you used gpt2 instead of an SFT model on HH to do DPO. Can you use this SFT model and this preference dataset to train with this branch?

sandeepchittilla · 2023-11-13T15:11:25Z

@PhungVanDuy @maxreciprocate I cannot seem to request a review (probably due to permission issues / first contribution reasons). Could you please advise?

Thank you so much for the great PR. We will review this PR asap! Can you share any wandb that you ran?

@PhungVanDuy were you able to review this?

I saw your wandb but actually the chart quite messup, seems reward/accuracies and reward/margin not clearly increase. I guess because you used gpt2 instead of an SFT model on HH to do DPO. Can you use this SFT model and this preference dataset to train with this branch?

That's indeed what I did for a quick iteration and because I was limited on the compute i had. I will run it with the mistral-7b on the ultrafeedback dataset and get back asap.

sandeepchittilla · 2023-11-24T08:55:07Z

@PhungVanDuy sorry for the delay, the gpus aren't always available. Here is a dpo run (ongoing) of 1 epoch with mistral-7b-sft-beta on the ultrafeedback_binarized dataset : https://wandb.ai/sharma-sandeepch/trlx/runs/kfpmeonf?workspace=user-sharma-sandeepch

Note :

Ultrafeedback is a challenging dataset for DPO because the rejected responses are randomly sampled
I have not done a sft pass on the data so we see some fluctuating plots.
I have limited memory and GPUs are not the best in class so I've chosen only a subset of test_prefs for evaluation

PhungVanDuy · 2023-11-26T14:09:04Z

@PhungVanDuy sorry for the delay, the gpus aren't always available. Here is a dpo run (ongoing) of 1 epoch with mistral-7b-sft-beta on the ultrafeedback_binarized dataset : https://wandb.ai/sharma-sandeepch/trlx/runs/kfpmeonf?workspace=user-sharma-sandeepch

Note :

Ultrafeedback is a challenging dataset for DPO because the rejected responses are randomly sampled

I have not done a sft pass on the data so we see some fluctuating plots.

I have limited memory and GPUs are not the best in class so I've chosen only a subset of test_prefs for evaluation

Thank you for your information, I will use SFT-beta, to check this. Let me help you to run on my cluster.

PhungVanDuy · 2023-11-26T21:33:30Z

@PhungVanDuy sorry for the delay, the gpus aren't always available. Here is a dpo run (ongoing) of 1 epoch with mistral-7b-sft-beta on the ultrafeedback_binarized dataset : https://wandb.ai/sharma-sandeepch/trlx/runs/kfpmeonf?workspace=user-sharma-sandeepch
Note :

Ultrafeedback is a challenging dataset for DPO because the rejected responses are randomly sampled

I have not done a sft pass on the data so we see some fluctuating plots.

I have limited memory and GPUs are not the best in class so I've chosen only a subset of test_prefs for evaluation

Thank you for your information, I will use SFT-beta, to check this. Let me help you to run on my cluster.

@sandeepchittilla can you add my discord with the handle: duyphung.ai, it will be easier to discuss on this. Thank you so much.

StellaAthena · 2024-01-07T22:20:36Z

I'm excited about DPO support and I hope it'll be added soon!

sandeepchittilla force-pushed the 504-dpo-trainer branch from 9c57623 to f57ae81 Compare September 12, 2023 16:05

sandeepchittilla added 3 commits September 12, 2023 16:11

Initial commit DPO support

a34219c

Add default config for DPO and trainer functionality

cd923c1

Add DPO training example and fix minor bugs

deb71c1

sandeepchittilla force-pushed the 504-dpo-trainer branch from f57ae81 to deb71c1 Compare September 12, 2023 16:12

sandeepchittilla added 2 commits September 13, 2023 08:33

Update .gitignore and minor refactor

ca7a828

Add type hinting and minor refactor

02bd944

sandeepchittilla marked this pull request as ready for review September 13, 2023 15:02

Update docstrings

00d0f2c

PhungVanDuy requested review from PhungVanDuy and maxreciprocate September 14, 2023 09:03

songyouwei reviewed Sep 19, 2023

View reviewed changes

Add Deepspeed init support when using stage 3

9b43dd5

sandeepchittilla marked this pull request as draft September 27, 2023 08:44

sandeepchittilla marked this pull request as ready for review September 27, 2023 08:44

sandeepchittilla added 2 commits November 14, 2023 14:04

Add dpo example

737a496

Update hyperparementers

dfa814d

sandeepchittilla force-pushed the 504-dpo-trainer branch 2 times, most recently from 6404f83 to 506fbbd Compare November 20, 2023 16:15

Fix prompt truncation bug and handle deepspeed preparation

6d63004

sandeepchittilla force-pushed the 504-dpo-trainer branch from 506fbbd to 6d63004 Compare November 20, 2023 16:16

Use slower training parameters

4a24603

Conversation

sandeepchittilla commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandeepchittilla commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandeepchittilla commented Sep 14, 2023

Uh oh!

PhungVanDuy commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandeepchittilla commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

songyouwei Sep 19, 2023

Choose a reason for hiding this comment

Uh oh!

LouisCastricato commented Nov 13, 2023

Uh oh!

sandeepchittilla commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhungVanDuy commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandeepchittilla commented Nov 13, 2023

Uh oh!

sandeepchittilla commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhungVanDuy commented Nov 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhungVanDuy commented Nov 26, 2023

Uh oh!

StellaAthena commented Jan 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sandeepchittilla commented Sep 7, 2023 •

edited

Loading

sandeepchittilla commented Sep 13, 2023 •

edited

Loading

PhungVanDuy commented Sep 14, 2023 •

edited

Loading

sandeepchittilla commented Sep 14, 2023 •

edited

Loading

sandeepchittilla commented Nov 13, 2023 •

edited

Loading

PhungVanDuy commented Nov 13, 2023 •

edited

Loading

sandeepchittilla commented Nov 24, 2023 •

edited

Loading

PhungVanDuy commented Nov 26, 2023 •

edited

Loading