-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
From discussions with OAI engineers, we've heard that a pretty simple method performs similarly well to RL. Basically, we let the policy generate n samples, rank these samples using the reward model, and then finetune the policy on the top k ranked samples. We call this method Babble and Prune.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels