Question about TTRL training schedule: reward convergence & test data usage

Hi, thanks for your excellent work on TTRL — it has been extremely helpful for our research!
I have two questions about the training procedure, and would really appreciate your clarification:

1. Does TTRL training stop when the reward converges?
In your experiments, do you monitor the reward curve and stop when it reaches convergence / plateau?
Or do you train for a fixed number of epochs regardless of reward stabilization?

2. Should test data be used only once in TTRL?
Since TTRL (Test-Time RL) generates rollouts conditioned on a test dataset, is it recommended that:
each test sample should be used exactly once?
or is it acceptable that the same test sample appears repeatedly during multiple epochs of training?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about TTRL training schedule: reward convergence & test data usage #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about TTRL training schedule: reward convergence & test data usage #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions