Skip to content

Question about TTRL training schedule: reward convergence & test data usage #48

@tls0523

Description

@tls0523

Hi, thanks for your excellent work on TTRL — it has been extremely helpful for our research!
I have two questions about the training procedure, and would really appreciate your clarification:

  1. Does TTRL training stop when the reward converges?
    In your experiments, do you monitor the reward curve and stop when it reaches convergence / plateau?
    Or do you train for a fixed number of epochs regardless of reward stabilization?

  2. Should test data be used only once in TTRL?
    Since TTRL (Test-Time RL) generates rollouts conditioned on a test dataset, is it recommended that:
    each test sample should be used exactly once?
    or is it acceptable that the same test sample appears repeatedly during multiple epochs of training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions