Skip to content

Conversation

@pjelement
Copy link

Summary:
This diff implements the TRANSFORMER warmup policy from "Attention is All You Need" (Vaswani et al., 2017) for learning rate scheduling in torchrec, and updates a model configuration to use it.

Implementation

Added WarmupPolicy.TRANSFORMER to fbcode/torchrec/optim/warmup.py which implements the formula:

lr = base_lr * min(step^(-0.5), step * warm_steps^(-1.5)) * lr_scale

This schedule provides:

  • Warmup phase: LR increases from near-zero to peak at warm_steps
  • Decay phase: LR decreases via inverse square root after warm_steps

The max_iters parameter serves as warm_steps in the formula. The schedule converges at step = warm_steps where both terms in the min() function become equal.

Testing

Added comprehensive unit tests in fbcode/torchrec/optim/tests/test_warmup.py:

  • Formula correctness at key milestones (step 1, warmup completion, post-warmup)
  • Monotonic increase during warmup phase
  • Monotonic decrease during decay phase
  • Proper application of lr_scale multiplier
  • Integration tests with WarmupOptimizer
  • Uses none_throws() from pyre_extensions for type-safe Optional handling

Updated fbcode/torchrec/optim/tests/BUCK to include pyre-extensions dependency.

Configuration Update

Updated fbcode/minimal_viable_ai/models/gysj/gysj_esr_roo/conf/model_roo_config.py to use TRANSFORMER warmup:

  • Changed both sparse and dense optimizers from LINEAR warmup to TRANSFORMER
  • Set warm_steps to 80,000 for both optimizers
  • Updated optimizer hyperparameters (lr=0.001, eps=1e-6, beta values) to work with TRANSFORMER schedule
  • Sparse optimizer changed from ROWWISE_ADAGRAD to ADAM for consistency with dense optimizer

Differential Revision: D87127589

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 15, 2025

@pjelement has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87127589.

…rch#3548)

Summary:

This diff implements the TRANSFORMER warmup policy from "Attention is All You Need" (Vaswani et al., 2017) for learning rate scheduling in torchrec, and updates a model configuration to use it.

## Implementation

Added `WarmupPolicy.TRANSFORMER` to `fbcode/torchrec/optim/warmup.py` which implements the formula:
```
lr = base_lr * min(step^(-0.5), step * warm_steps^(-1.5)) * lr_scale
```

This schedule provides:
- **Warmup phase**: LR increases from near-zero to peak at `warm_steps`
- **Decay phase**: LR decreases via inverse square root after `warm_steps`

The `max_iters` parameter serves as `warm_steps` in the formula. The schedule converges at step = warm_steps where both terms in the min() function become equal.

## Testing

Added comprehensive unit tests in `fbcode/torchrec/optim/tests/test_warmup.py`:
- Formula correctness at key milestones (step 1, warmup completion, post-warmup)
- Monotonic increase during warmup phase
- Monotonic decrease during decay phase
- Proper application of lr_scale multiplier
- Integration tests with WarmupOptimizer
- Uses `none_throws()` from `pyre_extensions` for type-safe Optional handling

Updated `fbcode/torchrec/optim/tests/BUCK` to include `pyre-extensions` dependency.

## Configuration Update

Updated `fbcode/minimal_viable_ai/models/gysj/gysj_esr_roo/conf/model_roo_config.py` to use TRANSFORMER warmup:
- Changed both sparse and dense optimizers from LINEAR warmup to TRANSFORMER
- Set warm_steps to 80,000 for both optimizers
- Updated optimizer hyperparameters (lr=0.001, eps=1e-6, beta values) to work with TRANSFORMER schedule
- Sparse optimizer changed from ROWWISE_ADAGRAD to ADAM for consistency with dense optimizer

Differential Revision: D87127589
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant