Hi, thanks for your excellent work on TTRL — it has been extremely helpful for our research!
I have two questions about the training procedure, and would really appreciate your clarification:
-
Does TTRL training stop when the reward converges?
In your experiments, do you monitor the reward curve and stop when it reaches convergence / plateau?
Or do you train for a fixed number of epochs regardless of reward stabilization?
-
Should test data be used only once in TTRL?
Since TTRL (Test-Time RL) generates rollouts conditioned on a test dataset, is it recommended that:
each test sample should be used exactly once?
or is it acceptable that the same test sample appears repeatedly during multiple epochs of training?