There's a bug in the GRPO advantage calculation at line 159 of verl/trainer/core_algos.py. The standard deviation computation has extra brackets that create an incorrect tensor shape.
Line 159
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
The id2score[idx] is already a list; wrapping it in additional brackets [id2score[idx]] creates a nested structure.