我是运行的./GPG/open-r1/train.sh
训练过程没有问题, 但是评估时(设置 --eval_strategy steps)时,下面这行代码的张量维度无法对齐:
per_token_loss = - per_token_logps * advantages.unsqueeze(1)
我debug后发现,问题来自于_generate_and_score_completions这个函数的下面这段代码(命名为片段-1):
if n_valid_samples < self.args.min_inverse_alpha * num_samples:
logger.info(f"keep generating more examples: the {n_gen}-th mini-batch")
n_gen += 1
else:
# 重新组装样本batch
rewards = merge(identical_rewards, new_rewards)[:len(prompts)]
print(
f"[DEBUG][RANK {self.accelerator.process_index}] lin999 {mode} rewards.shape:{rewards.shape},len(prompts):{len(prompts)}")
prompt_ids = merge_with_padding(identical_prompt_ids, new_prompt_ids, self.processing_class.pad_token_id, left_pad=True)[:len(prompts)]
prompt_mask = merge_with_padding(identical_prompt_mask, new_prompt_mask, 0, left_pad=True)[:len(prompts)]
completion_ids = merge_with_padding(identical_completion_ids, new_completion_ids, self.processing_class.pad_token_id, left_pad=False)[:len(prompts)]
completion_mask = merge_with_padding(identical_completion_mask, new_completion_mask, 0, left_pad=False)[:len(prompts)]
break
在第一个evaluate step时会执行else分支,可以正常运行。 但是在第二个evaluate step时会执行if分支,那么rewards的维度就和第一次不一样了。
例如 我用4卡计算时,超参数如下:
[INFO|trainer.py:2414] 2025-07-01 11:55:15,632 >> ***** Running training *****
[INFO|trainer.py:2417] 2025-07-01 11:55:15,633 >> Instantaneous batch size per device = 16
[INFO|trainer.py:2420] 2025-07-01 11:55:15,633 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2421] 2025-07-01 11:55:15,633 >> Gradient Accumulation steps = 2
那么第一个evaluate step执行完代码 片段-1 后, rewards维度是 [16], 但是第二个evaluate step执行完代码 片段-1 后,rewards维度是 [64],
就会导致在计算 per_token_loss = - per_token_logps * advantages.unsqueeze(1) 时候,维度不一致而报错,因为per_token_logps 的第一个维度一直是[16]
我是运行的./GPG/open-r1/train.sh
训练过程没有问题, 但是评估时(设置 --eval_strategy steps)时,下面这行代码的张量维度无法对齐:
per_token_loss = - per_token_logps * advantages.unsqueeze(1)
我debug后发现,问题来自于_generate_and_score_completions这个函数的下面这段代码(命名为片段-1):
if n_valid_samples < self.args.min_inverse_alpha * num_samples:
logger.info(f"keep generating more examples: the {n_gen}-th mini-batch")
n_gen += 1
else:
# 重新组装样本batch
rewards = merge(identical_rewards, new_rewards)[:len(prompts)]
print(
f"[DEBUG][RANK {self.accelerator.process_index}] lin999 {mode} rewards.shape:{rewards.shape},len(prompts):{len(prompts)}")
prompt_ids = merge_with_padding(identical_prompt_ids, new_prompt_ids, self.processing_class.pad_token_id, left_pad=True)[:len(prompts)]
prompt_mask = merge_with_padding(identical_prompt_mask, new_prompt_mask, 0, left_pad=True)[:len(prompts)]
completion_ids = merge_with_padding(identical_completion_ids, new_completion_ids, self.processing_class.pad_token_id, left_pad=False)[:len(prompts)]
completion_mask = merge_with_padding(identical_completion_mask, new_completion_mask, 0, left_pad=False)[:len(prompts)]
break
在第一个evaluate step时会执行else分支,可以正常运行。 但是在第二个evaluate step时会执行if分支,那么rewards的维度就和第一次不一样了。
例如 我用4卡计算时,超参数如下:
[INFO|trainer.py:2414] 2025-07-01 11:55:15,632 >> ***** Running training *****
[INFO|trainer.py:2417] 2025-07-01 11:55:15,633 >> Instantaneous batch size per device = 16
[INFO|trainer.py:2420] 2025-07-01 11:55:15,633 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:2421] 2025-07-01 11:55:15,633 >> Gradient Accumulation steps = 2
那么第一个evaluate step执行完代码 片段-1 后, rewards维度是 [16], 但是第二个evaluate step执行完代码 片段-1 后,rewards维度是 [64],
就会导致在计算 per_token_loss = - per_token_logps * advantages.unsqueeze(1) 时候,维度不一致而报错,因为per_token_logps 的第一个维度一直是[16]