Babble and Prune

From discussions with OAI engineers, we've heard that a pretty simple method performs similarly well to RL. Basically, we let the policy generate n samples, rank these samples using the reward model, and then finetune the policy on the top k ranked samples. We call this method Babble and Prune.