Reward function lacks constraints on reasoning process, causing model to skip CoT

It seems the current reward signal lacks regularization on the chain of thought (CoT). Consequently, the model tends to skip the reasoning trace and output the solution directly, potentially resulting in reward hacking.