reward_fn in accelerate_ppo_trainer.py

Hi, may I know why these two reward_fn function differently while they seem to be the same one passed to the PPO trainer as input? In my understanding of PPO, the reward function should output rewards for each sample instead of a sequence of (sequence_length, ) rewards.

https://github.com/CarperAI/trlx/blob/3340c2f3a56d1d14fdd5f13ad575121fa26b6d92/trlx/trainer/accelerate_ppo_trainer.py#L309-L310

https://github.com/CarperAI/trlx/blob/3340c2f3a56d1d14fdd5f13ad575121fa26b6d92/trlx/trlx.py#L38-L40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reward_fn in accelerate_ppo_trainer.py #602

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	# reward_fn should return list of rewards at each token per sample
	# NOTE: all_scores[0][i] is the reward due to token (action) i in prompt + response (b/c of how kl is computed)

	reward_fn (`Optional[Callable[[List[str], List[str], List[str]], List[float]]]`):
	A function to rate batches of generated samples. Its required arguments are
	(`samples`, `prompts`, `outputs`) and the return is a list of scalar rewards per each sample in batch

reward_fn in accelerate_ppo_trainer.py #602

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions